AI Foundations Training


            ← Back to Course

            
                Claude API in Practice – A Complete Developer Reference
                
                    Lesson 5: Production Patterns – Reliability and Cost at Scale                

                            Log in to enroll.
            
                
                
                    Lesson Objectives
By the end of this lesson, students should be able to:
Implement exponential backoff retry for rate limit and server errors
Apply prompt caching to reduce cost for repeated context
Use batch processing for high-volume offline workloads
Design an observability strategy for production API integrations
Lesson Content
Retry strategy.
Rate limits (429) and transient server errors (500, 529) require retry with exponential backoff:
“`python import time, anthropic from anthropic import RateLimitError, APIStatusError
def call_with_retry(client, kwargs, max_retries=3): for attempt in range(max_retries): try: return client.messages.create(kwargs) except RateLimitError: wait = (2  attempt) + (random.random() * 0.5)  # jitter time.sleep(wait) except APIStatusError as e: if e.status_code >= 500 and attempt < max_retries – 1: time.sleep(2  attempt) else: raise raise Exception("Max retries exceeded") “`
Jitter (random.random() * 0.5) prevents retry storms when multiple instances hit the limit simultaneously.
Prompt caching.
For integrations where a large system prompt or context document is sent on every request, prompt caching significantly reduces cost. Prompt caching stores a portion of the prompt on Anthropic's servers and charges a lower rate for cache hits (verify current caching pricing and parameters at docs.anthropic.com).
Use case: a customer support integration with a 5,000-token product knowledge base sent on every request. With prompt caching enabled, the first request caches the knowledge base; subsequent requests within the cache TTL pay a reduced rate for the cached portion.
Batch processing.
For high-volume offline workloads (classifying 100,000 documents, processing a dataset), the Anthropic Batch API reduces cost by processing requests asynchronously at a lower rate. Verify current batch API availability and pricing at docs.anthropic.com.
Do not use the batch API for real-time user interactions – it is for offline, non-latency-sensitive workloads.
Observability design.
Minimum production observability for API integrations:
“`python
Log per request
logger.info({ "request_id": response.id, "model": response.model, "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "stop_reason": response.stop_reason, "latency_ms": elapsed_ms, "feature": "customer_support_chat",   # application context "user_id": user_id                     # for attribution }) “`
This enables: cost attribution by feature, latency monitoring, max_tokens ceiling detection (stop_reason: max_tokens), and debugging of production anomalies.
Cost anomaly detection.
Set alerts:
Per-request token count exceeds 2× average – investigate input validation
Daily spend exceeds budget threshold – Anthropic console alert
Error rate exceeds 1% – alerting system notification
Practical Example
A developer reduces API costs 35% without changing model or output quality: (1) enables prompt caching on a 4,000-token system prompt repeated on every request (−25% cost on cache hits), (2) moves a nightly batch classification job from synchronous API calls to the batch API (−50% cost on that workload), (3) adds input length validation to prevent a class of long user inputs that were inflating token counts (−10% from outlier reduction).
Total: −35%.
All three changes are configuration and application logic – not model changes.
Safety Notes
Retry logic with aggressive backoff and multiple retries can mask application logic errors that produce consistent API errors. Monitor error rates in observability dashboards – a consistent 10% error rate that retries successfully is a symptom that deserves investigation, not just retry coverage. Retries handle transient failures; consistent errors indicate a systematic problem.
                

                            Log in and enroll to access lesson quizzes.
            
                        
            
                                    
                        Previous Lesson
                        ← Streaming Responses
                    
                            
            
                            
        
                    

            ← Back to Course