Distributed Redis Caching Strategy for High-Frequency API Calls
Our approach to caching real-time freight rate data with Redis — balancing freshness vs performance at 2000+ RPS. The layered L1/L2 caching architecture, cache invalidation via pub/sub, and the operational lessons from running it in production.
The Problem
Real-time freight rates change, but not that fast. Carrier APIs were being hammered with the same origin+destination+container combination thousands of times per minute — mostly cache-miss scenarios serving identical data.
The symptom was clear: P95 response time of 2,800ms on rate queries, carrier API rate limits being hit during peak hours, and an infrastructure bill driven primarily by carrier API call volume rather than actual compute. The root cause was equally clear: we had no caching layer, so every user request hit the carrier API directly.
The challenge was not simply adding Redis. The nuance was designing a caching strategy that:
- Kept rate data fresh enough for commercial accuracy (stale rates that have moved can cost the business real money on quoted contracts)
- Eliminated the carrier API call volume that was causing rate limits and latency
- Handled cache invalidation correctly when carriers pushed rate updates
- Maintained correctness under horizontal scaling (multiple API server instances reading and writing cache)
- Did not introduce new failure modes — a caching layer that fails under load should degrade gracefully, not bring down the rate service
Why Layered Caching
A single Redis cache would have solved most of the problem. We added an L1 in-process memory cache because of one specific observation: the same rate query was being made multiple times within the same second from different user sessions viewing the same route. Redis has low latency (~1ms within the same datacenter), but for queries happening at 2000+ RPS, even 1ms added up — and Redis itself would have become a bottleneck at sustained high throughput.
The layered architecture:
Request → In-Memory L1 (10s TTL)
↓ miss
Redis L2 (5min TTL)
↓ miss
Carrier API + write-through to Redis + L1L1 (in-memory) provides sub-millisecond response for high-frequency identical queries within a 10-second window. L2 (Redis) provides cross-instance data sharing with a 5-minute freshness guarantee. The carrier API is only called when both cache layers miss — when genuinely fresh data is needed.
The TTLs were determined empirically: freight rates updated more frequently than every 5 minutes in less than 3% of queries based on a 2-week analysis of carrier rate change events. A 5-minute L2 TTL meant accepting stale data in 3% of queries in exchange for the carrier API call reduction — a trade-off that worked for our commercial model (we disclosed that quoted rates were valid for 10 minutes and refreshed on booking).
Implementation
public class CachedRateService : IRateService
{
private readonly IDistributedCache _redis;
private readonly IMemoryCache _local;
private readonly IRateService _inner;
private readonly ILogger<CachedRateService> _logger;
public async Task<List<CarrierRate>> GetRatesAsync(RouteKey key)
{
var cacheKey = $"rates:{key}";
// L1: in-process memory
if (_local.TryGetValue(cacheKey, out List<CarrierRate>? cached))
{
_logger.LogDebug("L1 cache hit for {RouteKey}", key);
return cached!;
}
// L2: distributed Redis
var bytes = await _redis.GetAsync(cacheKey);
if (bytes is not null)
{
var rates = MessagePackSerializer.Deserialize<List<CarrierRate>>(bytes);
// Populate L1 from Redis hit — warm the local cache
_local.Set(cacheKey, rates, TimeSpan.FromSeconds(10));
_logger.LogDebug("L2 cache hit for {RouteKey}", key);
return rates;
}
// Origin: fetch & populate both layers
_logger.LogInformation("Cache miss for {RouteKey} — fetching from carrier API", key);
var fresh = await _inner.GetRatesAsync(key);
var packed = MessagePackSerializer.Serialize(fresh);
await _redis.SetAsync(cacheKey, packed, new DistributedCacheEntryOptions
{
AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(5)
});
_local.Set(cacheKey, fresh, TimeSpan.FromSeconds(10));
return fresh;
}
}Serialisation Choice: MessagePack Over JSON
The serialisation choice mattered at 2000+ RPS. We benchmarked three options:
| Format | Serialise (µs) | Deserialise (µs) | Size (bytes) | |--------|----------------|------------------|--------------| | System.Text.Json | 18.2 | 22.4 | 1,284 | | Newtonsoft.Json | 31.7 | 38.1 | 1,284 | | MessagePack | 3.1 | 4.8 | 487 |
MessagePack's binary format is 6x faster to serialise, 5x faster to deserialise, and 62% smaller. At 2000+ RPS, this saves approximately 15ms of CPU per second on serialisation alone — and the reduced payload size decreases both Redis memory usage and network transfer time.
The trade-off: MessagePack binaries are not human-readable, which makes debugging cache content harder. We addressed this by building a diagnostic endpoint that serialises a cache entry to JSON on demand, used only for debugging.
Preventing Cache Stampede
A naive implementation has a stampede problem: when L2 expires for a popular route, dozens of concurrent requests simultaneously miss the cache and all hit the carrier API at once. This is the cache stampede (or thundering herd) problem.
Our fix uses a distributed lock:
private async Task<List<CarrierRate>> GetRatesWithLockAsync(RouteKey key, string cacheKey)
{
var lockKey = $"lock:{cacheKey}";
var lockToken = Guid.NewGuid().ToString();
// Attempt to acquire lock (expires in 10s to prevent deadlock)
bool acquiredLock = await _redis.StringSetAsync(
lockKey, lockToken, TimeSpan.FromSeconds(10), When.NotExists);
if (acquiredLock)
{
try
{
var fresh = await _inner.GetRatesAsync(key);
await WriteToCacheAsync(cacheKey, fresh);
return fresh;
}
finally
{
// Release lock only if we still own it
await ReleaseLockAsync(lockKey, lockToken);
}
}
else
{
// Another instance is fetching — wait briefly and read from cache
await Task.Delay(200);
var bytes = await _redis.GetAsync(cacheKey);
if (bytes is not null)
return MessagePackSerializer.Deserialize<List<CarrierRate>>(bytes);
// If still no cache after waiting, fetch directly (last resort)
return await _inner.GetRatesAsync(key);
}
}This ensures that for any given route, only one server instance fetches from the carrier API when the cache expires. Other instances wait briefly and then read the freshly populated cache.
Cache Invalidation via Redis Pub/Sub
TTL-based expiration works for normal operation, but carriers occasionally push proactive rate updates — price adjustments, surcharge changes, lane closures. When these events arrive, we want to invalidate the affected cache entries immediately rather than waiting for the TTL to expire.
Redis pub/sub provides the cross-instance invalidation channel:
// Publisher — when a carrier rate update event arrives
await _subscriber.PublishAsync(
"rate-invalidation",
JsonSerializer.Serialize(new { carrier = "MAERSK", lanes = affectedLanes })
);
// Subscriber — on every API server instance
_subscriber.Subscribe("rate-invalidation", async (channel, message) =>
{
var invalidation = JsonSerializer.Deserialize<RateInvalidationEvent>(message!);
foreach (var lane in invalidation.Lanes)
{
var cacheKey = $"rates:{lane}";
// Remove from Redis L2
await _redis.RemoveAsync(cacheKey);
// Remove from in-process L1
_local.Remove(cacheKey);
_logger.LogInformation(
"Invalidated cache for {Carrier} lane {Lane}",
invalidation.Carrier, lane);
}
});The pub/sub pattern ensures all server instances invalidate their L1 cache simultaneously when a carrier pushes an update — not just the instance that received the carrier event. Without this, you would have inconsistent stale data across instances until their L1 TTLs expired.
The Invalidation Edge Case: What If the Invalidation Event Is Lost?
Redis pub/sub is fire-and-forget — messages sent when no subscriber is listening are lost. For our use case, a lost invalidation message means some instances serve stale rates until the 5-minute TTL expires. This was acceptable given our rate staleness SLA.
For use cases where invalidation correctness is critical, the pattern shifts to Redis Streams (persistent, durable message delivery) or a separate notification via a message bus with at-least-once delivery guarantees.
Redis Cluster Configuration
Production Redis configuration for this workload:
# redis.conf — production settings
maxmemory 6gb
maxmemory-policy allkeys-lru # Evict LRU keys when memory is full
activerehashing yes
# Persistence — we accept potential data loss on crash (caches regenerate)
appendonly no
save ""
# Cluster
cluster-enabled yes
cluster-node-timeout 5000
# Performance
hz 20 # Increased background task frequency for more responsive key expiry
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yesKey decisions:
allkeys-lrueviction — when Redis memory fills (which it does with bursty traffic), we evict the least recently used cache entries. This is correct for a cache — we do not need LRU-evicted data, we will regenerate it on the next miss.- No persistence — this is a cache, not a database. A Redis restart should regenerate from the carrier API, not from potentially-stale persisted data.
- 3-node cluster — primary/replica pairs for two nodes, with a third primary for distribution. Provides both read scaling and failure tolerance.
Operational Monitoring
Metrics we tracked in production:
// Custom metrics via OpenTelemetry
private void RecordCacheMetrics(string cacheLevel, bool hit, string routeKey)
{
_hitRate.Add(hit ? 1 : 0,
new KeyValuePair<string, object?>("level", cacheLevel),
new KeyValuePair<string, object?>("route_prefix", routeKey[..6]));
_requestCount.Add(1,
new KeyValuePair<string, object?>("level", cacheLevel));
}Dashboards we built:
- L1 hit rate by route prefix (identifies routes that do not cache well)
- L2 hit rate by route prefix
- Carrier API call rate and error rate (the signal that matters most commercially)
- Redis memory utilisation per node
- Cache key count by TTL bucket (identifies unexpected cache accumulation)
- Invalidation event processing latency (how quickly pub/sub messages reach all instances)
Alert thresholds:
- Carrier API error rate > 1% for 2 minutes → page on-call (indicates carrier API degradation)
- L2 hit rate < 85% → warning (suggests TTL tuning opportunity or new traffic pattern)
- Redis memory > 75% per node → warning (approaching eviction pressure)
Results
- API cache hit rate: 94% (split: ~65% L1, ~29% L2)
- Carrier API calls reduced by ~90%
- Average response time: 2,800ms → 85ms at P95
- Redis cluster: 3-node, 6GB total — handling 2,000+ RPS comfortably
- Infrastructure cost: 40% reduction in carrier API spend from reduced call volume
- Zero production incidents attributable to the caching layer in 18 months of operation
The 85ms P95 response time breaks down approximately: 2ms L1 lookup, 8ms L2 lookup, 75ms carrier API parsing (for L2/origin hits). The 40% of requests that hit L1 respond in under 5ms end-to-end.
When Not to Use This Pattern
Layered distributed caching is appropriate when:
- Data is read far more frequently than it is written
- Acceptable staleness window exists (seconds to minutes)
- Cache invalidation events can be reliably detected and published
- Multiple service instances share the same data requirements
It is not appropriate when:
- Data must be current to the millisecond (use direct reads or WebSocket subscriptions)
- Write volume is comparable to read volume (cache becomes a synchronisation problem)
- Data is user-specific and not shared across sessions (personalised caches are usually better served by client-side caching or session-scoped server state)
- Correctness is more important than performance and the data is complex to invalidate reliably
The pattern looks simple but the operational complexity — stampede prevention, cross-instance invalidation correctness, eviction strategy, monitoring — only becomes visible in production. Getting it right requires planning these concerns up front, not discovering them as incidents.
Muhammad Moid Shams is a Lead Software Engineer specialising in .NET, Azure, and distributed systems performance engineering.