Problem Statement
The current GeminiModel implementation does not support Gemini's explicit context caching feature, which provides up to 90% cost reduction on cached tokens. While Gemini 2.5 models have implicit caching, it doesn't work reliably with Strands' request structure (system prompt and tools in config instead of contents).
Current behavior:
- Every request sends full system prompt + tools (e.g., 13,494 tokens)
- No visibility into cached tokens
- No control over cache lifecycle
cached_content_token_count always returns None
Expected behavior:
- Ability to explicitly cache system prompt + tools
- 75-90% discount on cached tokens
- Cache visibility via
usage_metadata.cached_content_token_count
- Cache lifecycle management (create, delete, TTL)
Proposed Solution
Add explicit context caching support to GeminiModel similar to how BedrockModel implements cache_prompt parameter.
API design:
from strands.models.gemini import GeminiModel
model = GeminiModel(
model_id="gemini-2.5-flash",
client_args={"api_key": "..."},
enable_caching=True, # Enable auto-caching
cache_ttl="3600s" # Cache TTL (default 1 hour)
)
# Or manual cache management
model.create_cache(system_prompt, tool_specs, ttl="7200s")
model.delete_cache()
Key features:
- Auto-cache creation: Automatically creates cache on first request when
enable_caching=True
- Cache validation: Reuses cache when system prompt + tools match
- Visibility: Exposes
cachedTokens in metadata.usage
- Cache lifecycle: Methods for create/delete/manage cache
Implementation Details
Changes needed in strands/models/gemini.py:
- Add
enable_caching and cache_ttl to GeminiConfig
- Add
create_cache() and delete_cache() methods
- Modify
_format_request_config() to accept cached_content parameter
- Add cache validation logic in
_format_request()
- Expose
cached_content_token_count in metadata
References
Alternative Solutions
- Do nothing: Users pay 5-10x more in token costs
- Rely on implicit caching: Unreliable, no visibility, no control
Additional Context
Tested implementation shows:
- 68% token reduction on real workload
cached_content_token_count: 9,255 out of 13,564 total tokens
- Works with 30+ tools and complex system prompts
- Compatible with existing Strands agent loop
I'm happy to submit a PR with the implementation if this feature request is accepted.
Use Case
Agents with large system prompts or many tools (e.g., 30 tools = ~9K tokens) incur high costs on every request. For production workloads with 1,000+ messages/day, this becomes expensive quickly.
Example cost impact:
- Without caching: 13,564 tokens/msg × 30K msgs/month × $0.00000035 = $142/month
- With caching: ~4,300 effective tokens/msg × 30K msgs/month = $69/month
- Savings: $73/month ($876/year)
Alternatives Solutions
No response
Additional Context
No response
Problem Statement
The current
GeminiModelimplementation does not support Gemini's explicit context caching feature, which provides up to 90% cost reduction on cached tokens. While Gemini 2.5 models have implicit caching, it doesn't work reliably with Strands' request structure (system prompt and tools inconfiginstead ofcontents).Current behavior:
cached_content_token_countalways returnsNoneExpected behavior:
usage_metadata.cached_content_token_countProposed Solution
Add explicit context caching support to
GeminiModelsimilar to howBedrockModelimplementscache_promptparameter.API design:
Key features:
enable_caching=TruecachedTokensinmetadata.usageImplementation Details
Changes needed in
strands/models/gemini.py:enable_cachingandcache_ttltoGeminiConfigcreate_cache()anddelete_cache()methods_format_request_config()to acceptcached_contentparameter_format_request()cached_content_token_countin metadataReferences
Alternative Solutions
Additional Context
Tested implementation shows:
cached_content_token_count: 9,255out of 13,564 total tokensI'm happy to submit a PR with the implementation if this feature request is accepted.
Use Case
Agents with large system prompts or many tools (e.g., 30 tools = ~9K tokens) incur high costs on every request. For production workloads with 1,000+ messages/day, this becomes expensive quickly.
Example cost impact:
Alternatives Solutions
No response
Additional Context
No response