Claude API in Practice – A Complete Developer Reference By the end of this lesson, students should be able to: Retry strategy. Rate limits (429) and transient server errors (500, 529) require retry with exponential backoff: “`python import time, anthropic from anthropic import RateLimitError, APIStatusError def call_with_retry(client, kwargs, max_retries=3): for attempt in range(max_retries): try: return client.messages.create(kwargs) except RateLimitError: wait = (2 attempt) + (random.random() * 0.5) # jitter time.sleep(wait) except APIStatusError as e: if e.status_code >= 500 and attempt < max_retries – 1: time.sleep(2 attempt) else: raise raise Exception("Max retries exceeded") “` Jitter (random.random() * 0.5) prevents retry storms when multiple instances hit the limit simultaneously. Prompt caching. For integrations where a large system prompt or context document is sent on every request, prompt caching significantly reduces cost. Prompt caching stores a portion of the prompt on Anthropic's servers and charges a lower rate for cache hits (verify current caching pricing and parameters at docs.anthropic.com). Use case: a customer support integration with a 5,000-token product knowledge base sent on every request. With prompt caching enabled, the first request caches the knowledge base; subsequent requests within the cache TTL pay a reduced rate for the cached portion. Batch processing. For high-volume offline workloads (classifying 100,000 documents, processing a dataset), the Anthropic Batch API reduces cost by processing requests asynchronously at a lower rate. Verify current batch API availability and pricing at docs.anthropic.com. Do not use the batch API for real-time user interactions – it is for offline, non-latency-sensitive workloads. Observability design. Minimum production observability for API integrations: “`python logger.info({ "request_id": response.id, "model": response.model, "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "stop_reason": response.stop_reason, "latency_ms": elapsed_ms, "feature": "customer_support_chat", # application context "user_id": user_id # for attribution }) “` This enables: cost attribution by feature, latency monitoring, max_tokens ceiling detection (stop_reason: max_tokens), and debugging of production anomalies. Cost anomaly detection. Set alerts: A developer reduces API costs 35% without changing model or output quality: (1) enables prompt caching on a 4,000-token system prompt repeated on every request (−25% cost on cache hits), (2) moves a nightly batch classification job from synchronous API calls to the batch API (−50% cost on that workload), (3) adds input length validation to prevent a class of long user inputs that were inflating token counts (−10% from outlier reduction). Total: −35%. All three changes are configuration and application logic – not model changes. Retry logic with aggressive backoff and multiple retries can mask application logic errors that produce consistent API errors. Monitor error rates in observability dashboards – a consistent 10% error rate that retries successfully is a symptom that deserves investigation, not just retry coverage. Retries handle transient failures; consistent errors indicate a systematic problem. Log in and enroll to access lesson quizzes.
Lesson 5: Production Patterns – Reliability and Cost at Scale
Lesson Objectives
Lesson Content
Log per request
Practical Example
Safety Notes