AWS’s Bedrock LLM Service Now Includes Fast Routing & Caching

Category :

AI

Posted On :

Share This :

Businesses are growing more cost-conscious as they transition from testing generative AI in small prototypes to implementing it in production. After all, using large language models (LLMs) is not inexpensive. Returning to an ancient idea—caching—is one approach to cut costs. Another is to direct less complex requests to more economical, smaller models. AWS revealed both of these features for its Bedrock LLM hosting service on Wednesday at its re:Invent conference in Las Vegas.

First, let’s discuss the caching service. Assume that there is a paper and that several people are posing queries about it. Atul Deo, Bedrock’s director of product, informed me, “Every time you’re paying.” Additionally, the duration of these context windows is increasing. For instance, we will have 2 million [tokens of] context and 300k [tokens of] context with Nova. It might even rise significantly by the next year, in my opinion.

In essence, caching makes sure you don’t have to pay the model to perform tedious tasks and repeatedly process the same (or very similar) queries. AWS claims that this can cut costs by up to 90%, but there is also a side benefit of a far shorter latency for receiving a response from the model (up to 85%, according to AWS). Adobe experienced a 72% decrease in response time when testing prompt caching for a few of their generative AI apps on Bedrock.

The intelligent prompt routing for Bedrock is the other significant new feature. In order to assist businesses in finding the ideal balance between cost and performance, Bedrock can use this to automatically send prompts to several models within the same model family. Each model’s performance for a given query is automatically predicted by the system using a tiny language model, and the request is then routed appropriately.

Sometimes I might have a really straightforward question. Is it truly necessary for me to submit that query to the slowest and most expensive model? Most likely not. In essence, you want to develop the idea that “Hey, at run time, send the right query to the right model based on the incoming prompt,” Deo clarified.

Of fact, the idea of LLM routing is not new. AWS would probably contend that its solution is unique since the router can automatically guide inquiries without requiring a lot of human input, but startups like Martian and several open source projects also address this. However, it is also constrained because it can only direct queries to models that belong to the same model family. However, Deo informed me that the team intends to extend this system and provide users with greater customization in the future.

Finally, AWS is introducing a new Bedrock marketplace. According to Deo, the notion is that even though Amazon is collaborating with many of the bigger model suppliers, hundreds of specialized models are currently available, some of which may only have a small number of devoted customers. AWS is introducing a marketplace for these models since those clients have asked the business to support them. The main distinction is that users will need to provision and manage the capacity of their infrastructure themselves, whereas Bedrock normally takes care of this automatically. About 100 of these new and specialized types will be available on AWS overall, with more to follow.