Google Develops ‘Implicit Caching’ To Cut AI Model Expenses

Category :

AI

Posted On :

Share This :

 

Google is introducing a feature in its Gemini API that it says would lower the cost of its most recent AI models for outside developers.

Google refers to this functionality as “implicit caching” and claims that it can save 75% on “repetitive context” that is sent to models over the Gemini API. Google’s Gemini 2.5 Pro and 2.5 Flash versions are supported.

 

Given that employing frontier models is becoming more and more expensive, developers will probably be happy to hear that.

When your request encounters a cache, the Gemini 2.5 models immediately enable a 75% cost savings thanks to the implicit caching we recently introduced in the Gemini API 🚢.

Additionally, we reduced the minimum token needed to attack caches to 2K on 2.5 Pro and 1K on 2.5 Flash!

— May 8, 2025, Logan Kilpatrick (@OfficialLoganK)

 

Caching, a commonly used technique in the AI sector, reduces computing needs and costs by reusing frequently accessed or pre-calculated data from models. Caches, for instance, can hold responses to frequently asked inquiries by users, removing the requirement for the model to generate responses for the same query again.

 

In the past, Google provided model prompt caching, but only explicitly, requiring developers to specify their most frequently used prompts. Explicit prompt caching usually required a lot of manual labor, even though cost reductions were intended to be guaranteed.

Some developers expressed dissatisfaction with Google’s explicit caching technique for Gemini 2.5 Pro, claiming that it could result in unexpectedly high API costs. The Gemini team apologized and promised to make adjustments as complaints escalated over the last week.

 

Implicit caching is automatic as opposed to explicit caching. If a Gemini API request to a model encounters a cache, it passes on cost savings and is enabled by default for Gemini 2.5 models.

Google stated in a blog post that “[W]hen you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit.” “We will return cost savings to you in a dynamic manner.”

 

According to Google’s developer documentation, the minimum prompt token count for implicit caching is 1,024 for 2.5 Flash and 2,048 for 2.5 Pro. This is not a very high number, so it shouldn’t take much to initiate these automatic saves. The raw material used by data models is called tokens; one thousand tokens is equal to around 750 words.

 

This new function has some buyer-beware areas, especially considering Google’s previous claims of cost reductions via caching went awry. To boost the likelihood of implicit cache hits, Google advises developers to include repeating context at the start of requests. According to the company, context that may vary from request to request should be added at the end.

 

Additionally, Google provided no independent confirmation that the new implicit caching technique would result in the automatic savings that were claimed. Therefore, we must wait to hear from early adopters.