daBongo LMS AI Training Courses

Claude API in Practice – A Complete Developer Reference

Lesson 5: Production Patterns – Reliability and Cost at Scale

Lesson Objectives

By the end of this lesson, students should be able to:

  • Implement exponential backoff retry for rate limit and server errors
  • Apply prompt caching to reduce cost for repeated context
  • Use batch processing for high-volume offline workloads
  • Design an observability strategy for production API integrations

Lesson Content

Retry strategy.

Rate limits (429) and transient server errors (500, 529) require retry with exponential backoff:

“`python import time, anthropic from anthropic import RateLimitError, APIStatusError

def call_with_retry(client, kwargs, max_retries=3): for attempt in range(max_retries): try: return client.messages.create(kwargs) except RateLimitError: wait = (2 attempt) + (random.random() * 0.5) # jitter time.sleep(wait) except APIStatusError as e: if e.status_code >= 500 and attempt < max_retries – 1: time.sleep(2 attempt) else: raise raise Exception("Max retries exceeded") “`

Jitter (random.random() * 0.5) prevents retry storms when multiple instances hit the limit simultaneously.

Prompt caching.

For integrations where a large system prompt or context document is sent on every request, prompt caching significantly reduces cost. Prompt caching stores a portion of the prompt on Anthropic's servers and charges a lower rate for cache hits (verify current caching pricing and parameters at docs.anthropic.com).

Use case: a customer support integration with a 5,000-token product knowledge base sent on every request. With prompt caching enabled, the first request caches the knowledge base; subsequent requests within the cache TTL pay a reduced rate for the cached portion.

Batch processing.

For high-volume offline workloads (classifying 100,000 documents, processing a dataset), the Anthropic Batch API reduces cost by processing requests asynchronously at a lower rate. Verify current batch API availability and pricing at docs.anthropic.com.

Do not use the batch API for real-time user interactions – it is for offline, non-latency-sensitive workloads.

Observability design.

Minimum production observability for API integrations:

“`python

Log per request

logger.info({ "request_id": response.id, "model": response.model, "input_tokens": response.usage.input_tokens, "output_tokens": response.usage.output_tokens, "stop_reason": response.stop_reason, "latency_ms": elapsed_ms, "feature": "customer_support_chat", # application context "user_id": user_id # for attribution }) “`

This enables: cost attribution by feature, latency monitoring, max_tokens ceiling detection (stop_reason: max_tokens), and debugging of production anomalies.

Cost anomaly detection.

Set alerts:

  • Per-request token count exceeds 2× average – investigate input validation
  • Daily spend exceeds budget threshold – Anthropic console alert
  • Error rate exceeds 1% – alerting system notification

Practical Example

A developer reduces API costs 35% without changing model or output quality: (1) enables prompt caching on a 4,000-token system prompt repeated on every request (−25% cost on cache hits), (2) moves a nightly batch classification job from synchronous API calls to the batch API (−50% cost on that workload), (3) adds input length validation to prevent a class of long user inputs that were inflating token counts (−10% from outlier reduction).

Total: −35%.

All three changes are configuration and application logic – not model changes.

Safety Notes

Retry logic with aggressive backoff and multiple retries can mask application logic errors that produce consistent API errors. Monitor error rates in observability dashboards – a consistent 10% error rate that retries successfully is a symptom that deserves investigation, not just retry coverage. Retries handle transient failures; consistent errors indicate a systematic problem.

Log in and enroll to access lesson quizzes.

Scroll to Top