Systems Design Reference

Distributed Systems
& High-Availability
Architecture

A comprehensive deep-dive into caching strategies, data partitioning, consistency models, and resilience patterns for building systems that scale to millions.

Caching Consistent Hashing CAP Theorem CRDT Circuit Breaker Saga Pattern Raft Consensus Two-Phase Commit Bulkhead Event Sourcing
SCROLL
01

Distributed Systems Foundations

The immovable laws and hard constraints every architect must internalize.

99.9%
Three Nines SLA
8.7h
Downtime / year
99.99%
Four Nines
52m
Downtime / year
99.999%
Five Nines
5.3m
Downtime / year

⚖️ CAP Theorem

A distributed system can guarantee at most two of the three properties simultaneously:

Consistency Availability Partition Tolerance

Since network partitions are inevitable in real distributed systems, you must choose between CP (HBase, Zookeeper, etcd) and AP (Cassandra, DynamoDB, CouchDB) systems. CA systems exist only in single-node or perfect-network scenarios.

💡

Modern view: PACELC extends CAP — even when the network is stable, there's a trade-off between latency and consistency. A lower-latency system often returns slightly stale data.

📜 Fallacies of Distributed Computing

Eight assumptions that kill systems in production (Deutsch, 1994):

Fallacy 1

The network is reliable

Fallacy 2

Latency is zero

Fallacy 3

Bandwidth is infinite

Fallacy 4

The network is secure

Fallacy 5

Topology doesn't change

Fallacy 6

There is one administrator

Fallacy 7

Transport cost is zero

Fallacy 8

The network is homogeneous

🧮 Key Distributed Systems Formulas

Amdahl's Law — Parallel speedup ceiling:

Speedup = 1 / (S + (1 - S) / N)

where S = serial fraction, N = number of processors

Little's Law — Queue depth:

L = λ × W

L = avg items in system, λ = arrival rate, W = avg wait time

Availability Composition:

A_series = A₁ × A₂ × ... × Aₙ

Each dependency multiplies downtime risk

Redundant Availability:

A_parallel = 1 − (1−A₁)(1−A₂)

Redundancy dramatically raises overall availability

02

Caching Strategies

Caching is the single highest-leverage latency reduction technique in distributed systems.

▸ Cache Topology — Multi-Layer Architecture
Client Browser / CDN Edge
API Gateway
+ Rate Limiter
App Server 1
App Server 2
App Server N
Redis Cluster
(L2 Shared Cache)
Memcached
(Session Cache)
Primary DB
Read Replica ×2
Object Store
📥

Cache-Aside (Lazy Loading)

Application code manages the cache manually. On a miss, load from DB and populate the cache. Most common pattern.

Read-heavy Low write Redis / Memcached

Pros: Only requested data is cached; resilient to cache failure.
Cons: Cold start penalty; potential for stale data.

📤

Write-Through

Write to cache and DB simultaneously. Cache is always in sync with the database.

Strong consistency Write overhead

Pros: No stale reads after write.
Cons: Write latency doubled; unused data clutters cache.

🔄

Write-Behind (Write-Back)

Write to cache immediately, async-flush to DB later. Dramatically reduces write latency for bursty workloads.

High throughput Eventual
⚠️

Risk: Data loss if cache crashes before flush. Requires durable queues (WAL / Kafka) as buffer.

🌊

Read-Through

Cache sits in front of DB; on a miss, the cache (not app code) fetches data. Clean separation of concerns.

Transparent CDN model
💡

Popular in CDN and HTTP proxy setups. The cache is the single interface — no dual-path logic in application code.

⏱️

Eviction Policies

How entries are removed when the cache is full:

LRU

Least Recently Used — evict coldest entry

LFU

Least Frequently Used — evict rarest access

ARC

Adaptive Replacement Cache — self-tuning

TTL

Time-To-Live — expiry-based eviction

SLRU

Segmented LRU — hot/warm/cold tiers

💥

Cache Failure Modes

Cache Stampede

Many requests miss simultaneously and overwhelm DB. Fix: probabilistic early expiry or mutex locks.

Cache Penetration

Requests for non-existent keys bypass cache every time. Fix: Bloom filters or null-value caching.

Cache Avalanche

Many keys expire at once causing a DB spike. Fix: jittered TTLs and staggered seeding.

Pattern — Cache-Aside with Stampede Protection (Python)
import random, time

def get_with_xlru(key, ttl, recompute_fn):
    # XFetch / probabilistic early recomputation
    entry = cache.get(key)
    if entry is not None:
        value, expiry, delta = entry
        # Recompute early with probability proportional to staleness
        now = time.time()
        if now - delta * log(random.random()) >= expiry:
            pass  # fall through and recompute
        else:
            return value

    start = time.time()
    value = recompute_fn()          # expensive DB call
    delta = time.time() - start
    cache.set(key, (value, time.time() + ttl, delta), ex=ttl)
    return value
Strategy Read Latency Write Latency Consistency Complexity Best For
Cache-Aside Low (hit) Normal Eventual Medium Read-heavy workloads
Write-Through Low 2× DB Strong Medium Read-after-write critical
Write-Behind Low Very Low Weak High Write-heavy, tolerate risk
Read-Through Low (hit) Normal Eventual Low CDN, proxy layers
03

Data Partitioning & Sharding

Split data across nodes to overcome single-machine limits on storage, memory, and throughput.

🔢

Range Partitioning

Keys are divided into ordered ranges. Each shard owns a contiguous segment.

📊

Example: HBase, MongoDB range sharding, PostgreSQL declarative partitioning.
Hotspot risk: Sequential inserts (timestamps) overwhelm the latest shard.

🎲

Hash Partitioning

A hash function distributes keys uniformly across shards, eliminating hotspots for uniform access patterns.

🔑

Trade-off: Range queries become scatter-gather across all shards. Good for KV stores like Cassandra.

🌐

Consistent Hashing

Map both nodes and keys to a ring. Only K/N keys are remapped when a node joins or leaves (not a full reshuffle).

Used by Cassandra, DynamoDB, Riak, and Memcached. Virtual nodes (vnodes) further smooth the load distribution.

📋

Directory-Based Sharding

A lookup service (shard map) tracks which shard owns each key. Maximum flexibility — resharding is just metadata update.

⚠️

The directory becomes a critical dependency. Must be highly available, fast, and durable (often stored in Zookeeper or etcd).

▸ Consistent Hashing Ring with Virtual Nodes
N1 N2 N3 N4 N5 K:user:42 K:order:7 K:cart:99 Hash Ring clockwise routing

Hot Spot Mitigation

Key Salting

Append a random suffix (e.g., user:7742:3) to spread writes across shards, then scatter-gather on reads.

Write Sharding

Celebrity / celebrity-key problem: partition a single viral entity across N shards using a fan-out writer.

Time-Based Bucketing

Include time component in partition key to avoid all writes landing on today's shard (e.g., event:2024-W21:hash).

Cross-Shard Challenges

Scatter-Gather

Queries touching multiple shards fan out, then the coordinator merges results. Increases tail latency significantly.

Distributed Joins

Typically avoided. Co-location (keeping related data on the same shard) is the standard answer.

Resharding

Double-write both old and new shards, verify consistency, cut over. Minimize impact with gradual traffic migration.

Consistent Hashing — Virtual Nodes Implementation (Go-style Pseudocode)
type Ring struct {
    replicas int                  // virtual nodes per physical node
    sorted   []int               // sorted hash ring positions
    hashMap  map[int]string       // position -> node address
}

func (r *Ring) AddNode(node string) {
    for i := 0; i < r.replicas; i++ {
        key := hash(fmt.Sprintf("%s#%d", node, i))
        r.sorted = insertSorted(r.sorted, key)
        r.hashMap[key] = node
    }
}

func (r *Ring) Lookup(k string) string {
    h := hash(k)
    idx := binarySearchCeil(r.sorted, h)  // clockwise next node
    return r.hashMap[r.sorted[idx % len(r.sorted)]]
}
04

Consistency Models

Understanding what "correct" means for distributed state is the foundation of every database design decision.

ModelGuaranteeLatency ImpactExamplesUse When
Linearizability Every operation appears atomic and ordered globally; reads always see latest write High etcd, Zookeeper, Spanner Locks, leader election, financial balances
Sequential Consistency All operations appear in some global order; each process's ops are in program order Medium Early CPUs, some DBs When global wall-clock ordering isn't required
Causal Consistency Causally related ops appear in order. Concurrent ops may be seen differently per node Low-Medium MongoDB sessions, COPS Social feeds, collaborative editing
Read-Your-Writes You always see your own writes; others may not immediately Low Cassandra LOCAL_QUORUM, DynamoDB User profile updates, preferences
Eventual Consistency If no new writes occur, all replicas converge to the same value eventually Lowest DNS, Cassandra ALL-relaxed, S3 View counts, analytics, carts
Monotonic Read Once you read a value, subsequent reads never return older values Low Sticky sessions, client routing Time-series, log tailing

🔐 Consensus Algorithms

Raft (used by etcd, CockroachDB, TiKV) — Designed for understandability:

Leader Election

Random election timeouts; first to get majority votes becomes leader.

Log Replication

Leader appends entries; commits once majority acknowledges.

Safety

A node cannot win election without all committed entries — no data loss.

Paxos — Theoretical foundation but notoriously hard to implement correctly. Multi-Paxos, Fast Paxos, and Flexible Paxos trade latency for quorum flexibility.

🔁 CRDTs — Conflict-Free Replicated Data Types

Data structures that can be concurrently updated on multiple replicas and automatically merged without conflicts.

G-Counter PN-Counter OR-Set LWW-Register RGA (Text)

Used in: Riak, Redis CRDT module, collaborative editors (Figma, Google Docs), shopping carts.

💡

CRDTs guarantee strong eventual consistency — all correct replicas that have received all updates are in the same state.

🕐 Quorum Reads & Writes (NWR Model)

In a system with N replicas, choose W (write quorum) and R (read quorum) such that:

W + R > N → Strong Consistency
W > N/2 → Write-Write conflict-free

Example: N=3, W=2, R=2 — classic quorum. Tolerates 1 failure with both reads and writes.

Setting W=1, R=N gives high write availability; W=N, R=1 gives strong durability.

Two-Phase Commit (2PC) — Coordinator Logic
async def two_phase_commit(coordinator, participants, txn):
    # ── Phase 1: Prepare ──────────────────────────────
    votes = await asyncio.gather(*[
        p.prepare(txn) for p in participants
    ], return_exceptions=True)

    if all(v == "YES" for v in votes):
        # ── Phase 2: Commit ───────────────────────────────
        coordinator.log_decision("COMMIT")    # durable WAL write
        await asyncio.gather(*[
            p.commit(txn) for p in participants
        ])
    else:
        coordinator.log_decision("ABORT")
        await asyncio.gather(*[
            p.rollback(txn) for p in participants
        ])

# ⚠️  2PC is blocking: if coordinator crashes after COMMIT log
#    but before sending, participants are stuck.
#    3PC and Paxos Commit address this but add round-trips.
05

Resiliency Patterns

Systems fail. Design for failure as a first-class concern, not an afterthought.

Circuit Breaker

Three states: Closed (normal), Open (failing, fast-fail), Half-Open (probe recovery).

Prevents cascading failures by stopping calls to a degraded service. Exponential backoff gates transition back to Closed.

HystrixResilience4jEnvoy
🔁

Retry with Backoff & Jitter

Automatically retry transient failures. Exponential backoff prevents thundering herd. Jitter (±random %) prevents synchronized retry storms.

delay = min(cap, base × 2ⁿ) × (0.5 + rand(0, 0.5))
🧱

Bulkhead

Isolate resources (thread pools, connection pools, semaphores) per consumer/service. One slow dependency cannot exhaust shared resources and bring down everything else.

🚢

Named after watertight compartments in a ship — a breach in one bulkhead doesn't sink the vessel.

Timeout & Deadline Propagation

Every network call must have a timeout. Critically, propagate a context deadline down the entire call chain so downstream services abort work that the client will never use.

⚠️

A timeout without deadline propagation causes "orphan work" — resources consumed by abandoned requests.

🏃

Rate Limiting & Load Shedding

Protect services from overload. Common algorithms:

Token Bucket Leaky Bucket Fixed Window Sliding Window

Load shedding actively drops low-priority work under pressure to preserve capacity for high-priority requests.

🪞

Fallback & Graceful Degradation

Return a cached response, default value, or reduced-feature response when the primary path fails. Always prefer partial availability over hard failure.

💡

Design user experiences around degraded states: "Recommendations unavailable, showing popular items instead."

▸ Circuit Breaker State Machine
CLOSED — Normal Operation
requests pass through, failure counter accumulates
failure threshold exceeded
OPEN — Fast Fail
all calls rejected immediately; reset timer starts
reset timeout elapsed
HALF-OPEN — Probe
limited requests allowed; success → CLOSED, fail → OPEN
Token Bucket Rate Limiter (Redis Lua Script)
-- KEYS[1] = bucket key, ARGV[1] = capacity, ARGV[2] = refill/sec, ARGV[3] = requested tokens
local key       = KEYS[1]
local capacity  = tonumber(ARGV[1])
local refill    = tonumber(ARGV[2])
local requested = tonumber(ARGV[3])
local now       = redis.call('TIME')
local t         = now[1] + now[2] / 1e6

local last   = tonumber(redis.call('HGET', key, 'last')) or t
local tokens = tonumber(redis.call('HGET', key, 'tokens')) or capacity

-- Refill proportional to elapsed time
tokens = math.min(capacity, tokens + (t - last) * refill)

if tokens >= requested then
    redis.call('HMSET', key, 'tokens', tokens - requested, 'last', t)
    redis.call('EXPIRE', key, 3600)
    return 1   -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last', t)
    return 0   -- throttled
end
06

Architectural Patterns

High-level blueprints for structuring distributed systems with clear trade-offs.

📮 Saga Pattern

Manage long-running transactions across microservices without distributed locks. Each step publishes an event; failures trigger compensating transactions.

Choreography

Services react to events with no central coordinator. Simple but hard to trace/debug at scale.

Orchestration

A Saga Orchestrator drives the workflow, calling each service in turn. Easier to observe; single point of failure risk.

⚠️

Sagas guarantee eventual consistency, not ACID atomicity. Design compensating transactions to be idempotent.

📝 Event Sourcing + CQRS

Event Sourcing: Never mutate state — store an immutable sequence of facts. Rebuild current state by replaying the event log.

CQRS: Separate read and write models. Write side handles commands and emits events; read side maintains denormalized projections optimized for queries.

Kafka EventStore Axon Projections
💡

The event log is your audit trail, time-travel debugger, and the source of truth. Read models are derived and discardable.

🔄 Outbox Pattern

Atomically write a DB record and an outbox event in the same local transaction. A relay process (Debezium CDC) polls the outbox and publishes to the message broker. Eliminates dual-write problems.

-- Atomic: order write + outbox in same txn
BEGIN;
  INSERT INTO orders (id, ...) VALUES (...);
  INSERT INTO outbox (event_type, payload)
    VALUES ('OrderPlaced', '{"id":42,...}');
COMMIT;

🔐 Distributed Locking

For mutual exclusion across services. Options ranked by safety:

Redlock (Redis)

Lock majority of N Redis masters. Watch for clock skew; GC pauses can cause false expiry.

etcd / Zookeeper

Strongly consistent; use ephemeral nodes + watches for lease-based locks. Safer than Redis.

DB Serializable Txn

SELECT FOR UPDATE on a row. Simple, safe, but limited to a single DB instance.

▸ Choreography Saga — E-Commerce Order Flow
Order Service
→ OrderCreated →
Payment Service
PaymentProcessed / Failed
Inventory Service
Shipping Service
← ReserveInventory FAIL →
Compensate: Refund
07

Observability & SLOs

You cannot run distributed systems without deep observability. Instrument everything.

📊

The Three Pillars

Metrics

Aggregated numerical data over time — CPU, request rate, error rate, latency percentiles (p50/p95/p99). Prometheus + Grafana.

Logs

Structured (JSON) timestamped events. Correlation IDs link log lines across services. ELK / Loki stack.

Traces

End-to-end request journey across services with spans. Identify bottlenecks in complex call graphs. Jaeger / Zipkin / Tempo.

🎯

SLIs, SLOs & Error Budgets

SLI — The actual metric (e.g., request success rate, p99 latency).

SLO — Target for the SLI (e.g., "99.9% requests <200ms over 30 days").

Error Budget = 1 − SLO. The amount of unreliability you're allowed to spend on risky deployments. Once exhausted, freeze releases.

💡

Error budgets align engineering incentives — reliability work gets prioritized automatically when budgets burn down.

⚗️

USE & RED Methods

USE (Brendan Gregg — for resources):

Utilization Saturation Errors

RED (Tom Wilkie — for services):

Rate Errors Duration

RED aligns directly with the user experience. Always instrument your services with RED metrics at every layer.

Failure TypeDetection SignalMitigationRecovery
Node Crash Missing heartbeat, health check failure Replication, redundancy Automatic failover via leader election
Network Partition Increased timeouts, split-brain detection Fencing tokens, majority quorum Manual resolution or automatic merge (CRDT)
Slow Service (Grey Failure) p99 latency spike, queue depth rising Timeout, circuit breaker, bulkhead Auto-scaling, load shedding
Cascade Failure Error rate propagating upstream Circuit breaker, fallback Gradual traffic ramp-up
Data Corruption Checksum mismatch, unexpected values Checksums, replication Restore from last good snapshot + WAL replay
🏗️

Chaos Engineering — Deliberately inject failures in production (or staging) to find weaknesses before users do. Netflix Chaos Monkey pioneered this. Modern tools: Gremlin, Chaos Mesh, Litmus. The goal is building confidence in the system's resiliency through empirical validation — not hoping it works.