Distributed Systems & High-Availability Architecture

Distributed Systems Foundations

The immovable laws and hard constraints every architect must internalize.

99.9%

Three Nines SLA

8.7h

Downtime / year

99.99%

Four Nines

52m

Downtime / year

99.999%

Five Nines

5.3m

Downtime / year

⚖️ CAP Theorem

A distributed system can guarantee at most two of the three properties simultaneously:

Consistency Availability Partition Tolerance

Since network partitions are inevitable in real distributed systems, you must choose between CP (HBase, Zookeeper, etcd) and AP (Cassandra, DynamoDB, CouchDB) systems. CA systems exist only in single-node or perfect-network scenarios.

💡

Modern view: PACELC extends CAP — even when the network is stable, there's a trade-off between latency and consistency. A lower-latency system often returns slightly stale data.

📜 Fallacies of Distributed Computing

Eight assumptions that kill systems in production (Deutsch, 1994):

Fallacy 1

The network is reliable

Fallacy 2

Latency is zero

Fallacy 3

Bandwidth is infinite

Fallacy 4

The network is secure

Fallacy 5

Topology doesn't change

Fallacy 6

There is one administrator

Fallacy 7

Transport cost is zero

Fallacy 8

The network is homogeneous

🧮 Key Distributed Systems Formulas

Amdahl's Law — Parallel speedup ceiling:

Speedup = 1 / (S + (1 - S) / N)

where S = serial fraction, N = number of processors

Little's Law — Queue depth:

L = λ × W

L = avg items in system, λ = arrival rate, W = avg wait time

Availability Composition:

A_series = A₁ × A₂ × ... × Aₙ

Each dependency multiplies downtime risk

Redundant Availability:

A_parallel = 1 − (1−A₁)(1−A₂)

Redundancy dramatically raises overall availability

Caching Strategies

Caching is the single highest-leverage latency reduction technique in distributed systems.

▸ Cache Topology — Multi-Layer Architecture

Client Browser / CDN Edge

↓

API Gateway
+ Rate Limiter

↓

App Server 1

App Server 2

App Server N

↓

Redis Cluster
(L2 Shared Cache)

Memcached
(Session Cache)

↓

Primary DB

Read Replica ×2

Object Store

📥

Cache-Aside (Lazy Loading)

Application code manages the cache manually. On a miss, load from DB and populate the cache. Most common pattern.

Read-heavy Low write Redis / Memcached

⚡

Pros: Only requested data is cached; resilient to cache failure.
Cons: Cold start penalty; potential for stale data.

📤

Write-Through

Write to cache and DB simultaneously. Cache is always in sync with the database.

Strong consistency Write overhead

⚡

Pros: No stale reads after write.
Cons: Write latency doubled; unused data clutters cache.

🔄

Write-Behind (Write-Back)

Write to cache immediately, async-flush to DB later. Dramatically reduces write latency for bursty workloads.

High throughput Eventual

⚠️

Risk: Data loss if cache crashes before flush. Requires durable queues (WAL / Kafka) as buffer.

🌊

Read-Through

Cache sits in front of DB; on a miss, the cache (not app code) fetches data. Clean separation of concerns.

Transparent CDN model

💡

Popular in CDN and HTTP proxy setups. The cache is the single interface — no dual-path logic in application code.

⏱️

Eviction Policies

How entries are removed when the cache is full:

LRU

Least Recently Used — evict coldest entry

LFU

Least Frequently Used — evict rarest access

ARC

Adaptive Replacement Cache — self-tuning

TTL

Time-To-Live — expiry-based eviction

SLRU

Segmented LRU — hot/warm/cold tiers

💥

Cache Failure Modes

Cache Stampede

Many requests miss simultaneously and overwhelm DB. Fix: probabilistic early expiry or mutex locks.

Cache Penetration

Requests for non-existent keys bypass cache every time. Fix: Bloom filters or null-value caching.

Cache Avalanche

Many keys expire at once causing a DB spike. Fix: jittered TTLs and staggered seeding.

Pattern — Cache-Aside with Stampede Protection (Python)

import random, time

def get_with_xlru(key, ttl, recompute_fn):
    # XFetch / probabilistic early recomputation
    entry = cache.get(key)
    if entry is not None:
        value, expiry, delta = entry
        # Recompute early with probability proportional to staleness
        now = time.time()
        if now - delta * log(random.random()) >= expiry:
            pass  # fall through and recompute
        else:
            return value

    start = time.time()
    value = recompute_fn()          # expensive DB call
    delta = time.time() - start
    cache.set(key, (value, time.time() + ttl, delta), ex=ttl)
    return value

Strategy	Read Latency	Write Latency	Consistency	Complexity	Best For
Cache-Aside	Low (hit)	Normal	Eventual	Medium	Read-heavy workloads
Write-Through	Low	2× DB	Strong	Medium	Read-after-write critical
Write-Behind	Low	Very Low	Weak	High	Write-heavy, tolerate risk
Read-Through	Low (hit)	Normal	Eventual	Low	CDN, proxy layers

Data Partitioning & Sharding

Split data across nodes to overcome single-machine limits on storage, memory, and throughput.

🔢

Range Partitioning

Keys are divided into ordered ranges. Each shard owns a contiguous segment.

📊

Example: HBase, MongoDB range sharding, PostgreSQL declarative partitioning.
Hotspot risk: Sequential inserts (timestamps) overwhelm the latest shard.

🎲

Hash Partitioning

A hash function distributes keys uniformly across shards, eliminating hotspots for uniform access patterns.

🔑

Trade-off: Range queries become scatter-gather across all shards. Good for KV stores like Cassandra.

🌐

Consistent Hashing

Map both nodes and keys to a ring. Only K/N keys are remapped when a node joins or leaves (not a full reshuffle).

⭕

Used by Cassandra, DynamoDB, Riak, and Memcached. Virtual nodes (vnodes) further smooth the load distribution.

📋

Directory-Based Sharding

A lookup service (shard map) tracks which shard owns each key. Maximum flexibility — resharding is just metadata update.

⚠️

The directory becomes a critical dependency. Must be highly available, fast, and durable (often stored in Zookeeper or etcd).

▸ Consistent Hashing Ring with Virtual Nodes

Hot Spot Mitigation

Key Salting

Append a random suffix (e.g., user:7742:3) to spread writes across shards, then scatter-gather on reads.

Write Sharding

Celebrity / celebrity-key problem: partition a single viral entity across N shards using a fan-out writer.

Time-Based Bucketing

Include time component in partition key to avoid all writes landing on today's shard (e.g., event:2024-W21:hash).

Cross-Shard Challenges

Scatter-Gather

Queries touching multiple shards fan out, then the coordinator merges results. Increases tail latency significantly.

Distributed Joins

Typically avoided. Co-location (keeping related data on the same shard) is the standard answer.

Resharding

Double-write both old and new shards, verify consistency, cut over. Minimize impact with gradual traffic migration.

Consistent Hashing — Virtual Nodes Implementation (Go-style Pseudocode)

type Ring struct {
    replicas int                  // virtual nodes per physical node
    sorted   []int               // sorted hash ring positions
    hashMap  map[int]string       // position -> node address
}

func (r *Ring) AddNode(node string) {
    for i := 0; i < r.replicas; i++ {
        key := hash(fmt.Sprintf("%s#%d", node, i))
        r.sorted = insertSorted(r.sorted, key)
        r.hashMap[key] = node
    }
}

func (r *Ring) Lookup(k string) string {
    h := hash(k)
    idx := binarySearchCeil(r.sorted, h)  // clockwise next node
    return r.hashMap[r.sorted[idx % len(r.sorted)]]
}

Consistency Models

Understanding what "correct" means for distributed state is the foundation of every database design decision.

Model	Guarantee	Latency Impact	Examples	Use When
Linearizability	Every operation appears atomic and ordered globally; reads always see latest write	High	etcd, Zookeeper, Spanner	Locks, leader election, financial balances
Sequential Consistency	All operations appear in some global order; each process's ops are in program order	Medium	Early CPUs, some DBs	When global wall-clock ordering isn't required
Causal Consistency	Causally related ops appear in order. Concurrent ops may be seen differently per node	Low-Medium	MongoDB sessions, COPS	Social feeds, collaborative editing
Read-Your-Writes	You always see your own writes; others may not immediately	Low	Cassandra LOCAL_QUORUM, DynamoDB	User profile updates, preferences
Eventual Consistency	If no new writes occur, all replicas converge to the same value eventually	Lowest	DNS, Cassandra ALL-relaxed, S3	View counts, analytics, carts
Monotonic Read	Once you read a value, subsequent reads never return older values	Low	Sticky sessions, client routing	Time-series, log tailing

🔐 Consensus Algorithms

Raft (used by etcd, CockroachDB, TiKV) — Designed for understandability:

Leader Election

Random election timeouts; first to get majority votes becomes leader.

Log Replication

Leader appends entries; commits once majority acknowledges.

Safety

A node cannot win election without all committed entries — no data loss.

Paxos — Theoretical foundation but notoriously hard to implement correctly. Multi-Paxos, Fast Paxos, and Flexible Paxos trade latency for quorum flexibility.

🔁 CRDTs — Conflict-Free Replicated Data Types

Data structures that can be concurrently updated on multiple replicas and automatically merged without conflicts.

G-Counter PN-Counter OR-Set LWW-Register RGA (Text)

Used in: Riak, Redis CRDT module, collaborative editors (Figma, Google Docs), shopping carts.

💡

CRDTs guarantee strong eventual consistency — all correct replicas that have received all updates are in the same state.

🕐 Quorum Reads & Writes (NWR Model)

In a system with N replicas, choose W (write quorum) and R (read quorum) such that:

W + R > N → Strong Consistency

W > N/2 → Write-Write conflict-free

Example: N=3, W=2, R=2 — classic quorum. Tolerates 1 failure with both reads and writes.

Setting W=1, R=N gives high write availability; W=N, R=1 gives strong durability.

Two-Phase Commit (2PC) — Coordinator Logic

async def two_phase_commit(coordinator, participants, txn):
    # ── Phase 1: Prepare ──────────────────────────────
    votes = await asyncio.gather(*[
        p.prepare(txn) for p in participants
    ], return_exceptions=True)

    if all(v == "YES" for v in votes):
        # ── Phase 2: Commit ───────────────────────────────
        coordinator.log_decision("COMMIT")    # durable WAL write
        await asyncio.gather(*[
            p.commit(txn) for p in participants
        ])
    else:
        coordinator.log_decision("ABORT")
        await asyncio.gather(*[
            p.rollback(txn) for p in participants
        ])

# ⚠️  2PC is blocking: if coordinator crashes after COMMIT log
#    but before sending, participants are stuck.
#    3PC and Paxos Commit address this but add round-trips.

Resiliency Patterns

Systems fail. Design for failure as a first-class concern, not an afterthought.

⚡

Circuit Breaker

Three states: Closed (normal), Open (failing, fast-fail), Half-Open (probe recovery).

Prevents cascading failures by stopping calls to a degraded service. Exponential backoff gates transition back to Closed.

HystrixResilience4jEnvoy

🔁

Retry with Backoff & Jitter

Automatically retry transient failures. Exponential backoff prevents thundering herd. Jitter (±random %) prevents synchronized retry storms.

delay = min(cap, base × 2ⁿ) × (0.5 + rand(0, 0.5))

🧱

Bulkhead

Isolate resources (thread pools, connection pools, semaphores) per consumer/service. One slow dependency cannot exhaust shared resources and bring down everything else.

🚢

Named after watertight compartments in a ship — a breach in one bulkhead doesn't sink the vessel.

⏳

Timeout & Deadline Propagation

Every network call must have a timeout. Critically, propagate a context deadline down the entire call chain so downstream services abort work that the client will never use.

⚠️

A timeout without deadline propagation causes "orphan work" — resources consumed by abandoned requests.

🏃

Rate Limiting & Load Shedding

Protect services from overload. Common algorithms:

Token Bucket Leaky Bucket Fixed Window Sliding Window

Load shedding actively drops low-priority work under pressure to preserve capacity for high-priority requests.

🪞

Fallback & Graceful Degradation

Return a cached response, default value, or reduced-feature response when the primary path fails. Always prefer partial availability over hard failure.

💡

Design user experiences around degraded states: "Recommendations unavailable, showing popular items instead."

▸ Circuit Breaker State Machine

CLOSED — Normal Operation
requests pass through, failure counter accumulates

↓ failure threshold exceeded

OPEN — Fast Fail
all calls rejected immediately; reset timer starts

↓ reset timeout elapsed

HALF-OPEN — Probe
limited requests allowed; success → CLOSED, fail → OPEN

Token Bucket Rate Limiter (Redis Lua Script)

-- KEYS[1] = bucket key, ARGV[1] = capacity, ARGV[2] = refill/sec, ARGV[3] = requested tokens
local key       = KEYS[1]
local capacity  = tonumber(ARGV[1])
local refill    = tonumber(ARGV[2])
local requested = tonumber(ARGV[3])
local now       = redis.call('TIME')
local t         = now[1] + now[2] / 1e6

local last   = tonumber(redis.call('HGET', key, 'last')) or t
local tokens = tonumber(redis.call('HGET', key, 'tokens')) or capacity

-- Refill proportional to elapsed time
tokens = math.min(capacity, tokens + (t - last) * refill)

if tokens >= requested then
    redis.call('HMSET', key, 'tokens', tokens - requested, 'last', t)
    redis.call('EXPIRE', key, 3600)
    return 1   -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last', t)
    return 0   -- throttled
end

Architectural Patterns

High-level blueprints for structuring distributed systems with clear trade-offs.

📮 Saga Pattern

Manage long-running transactions across microservices without distributed locks. Each step publishes an event; failures trigger compensating transactions.

Choreography

Services react to events with no central coordinator. Simple but hard to trace/debug at scale.

Orchestration

A Saga Orchestrator drives the workflow, calling each service in turn. Easier to observe; single point of failure risk.

⚠️

Sagas guarantee eventual consistency, not ACID atomicity. Design compensating transactions to be idempotent.

📝 Event Sourcing + CQRS

Event Sourcing: Never mutate state — store an immutable sequence of facts. Rebuild current state by replaying the event log.

CQRS: Separate read and write models. Write side handles commands and emits events; read side maintains denormalized projections optimized for queries.

Kafka EventStore Axon Projections

💡

The event log is your audit trail, time-travel debugger, and the source of truth. Read models are derived and discardable.

🔄 Outbox Pattern

Atomically write a DB record and an outbox event in the same local transaction. A relay process (Debezium CDC) polls the outbox and publishes to the message broker. Eliminates dual-write problems.

-- Atomic: order write + outbox in same txn
BEGIN;
  INSERT INTO orders (id, ...) VALUES (...);
  INSERT INTO outbox (event_type, payload)
    VALUES ('OrderPlaced', '{"id":42,...}');
COMMIT;

🔐 Distributed Locking

For mutual exclusion across services. Options ranked by safety:

Redlock (Redis)

Lock majority of N Redis masters. Watch for clock skew; GC pauses can cause false expiry.

etcd / Zookeeper

Strongly consistent; use ephemeral nodes + watches for lease-based locks. Safer than Redis.

DB Serializable Txn

SELECT FOR UPDATE on a row. Simple, safe, but limited to a single DB instance.

▸ Choreography Saga — E-Commerce Order Flow

Order Service

→ OrderCreated →

Payment Service

→

PaymentProcessed / Failed

→

Inventory Service

→

Shipping Service

← ReserveInventory FAIL →

Compensate: Refund

Observability & SLOs

You cannot run distributed systems without deep observability. Instrument everything.

📊

The Three Pillars

Metrics

Aggregated numerical data over time — CPU, request rate, error rate, latency percentiles (p50/p95/p99). Prometheus + Grafana.

Logs

Structured (JSON) timestamped events. Correlation IDs link log lines across services. ELK / Loki stack.

Traces

End-to-end request journey across services with spans. Identify bottlenecks in complex call graphs. Jaeger / Zipkin / Tempo.

🎯

SLIs, SLOs & Error Budgets

SLI — The actual metric (e.g., request success rate, p99 latency).

SLO — Target for the SLI (e.g., "99.9% requests <200ms over 30 days").

Error Budget = 1 − SLO. The amount of unreliability you're allowed to spend on risky deployments. Once exhausted, freeze releases.

💡

Error budgets align engineering incentives — reliability work gets prioritized automatically when budgets burn down.

⚗️

USE & RED Methods

USE (Brendan Gregg — for resources):

Utilization Saturation Errors

RED (Tom Wilkie — for services):

Rate Errors Duration

RED aligns directly with the user experience. Always instrument your services with RED metrics at every layer.

Failure Type	Detection Signal	Mitigation	Recovery
Node Crash	Missing heartbeat, health check failure	Replication, redundancy	Automatic failover via leader election
Network Partition	Increased timeouts, split-brain detection	Fencing tokens, majority quorum	Manual resolution or automatic merge (CRDT)
Slow Service (Grey Failure)	p99 latency spike, queue depth rising	Timeout, circuit breaker, bulkhead	Auto-scaling, load shedding
Cascade Failure	Error rate propagating upstream	Circuit breaker, fallback	Gradual traffic ramp-up
Data Corruption	Checksum mismatch, unexpected values	Checksums, replication	Restore from last good snapshot + WAL replay

🏗️

Chaos Engineering — Deliberately inject failures in production (or staging) to find weaknesses before users do. Netflix Chaos Monkey pioneered this. Modern tools: Gremlin, Chaos Mesh, Litmus. The goal is building confidence in the system's resiliency through empirical validation — not hoping it works.

Distributed Systems& High-AvailabilityArchitecture

Distributed Systems Foundations

⚖️ CAP Theorem

📜 Fallacies of Distributed Computing

🧮 Key Distributed Systems Formulas

Caching Strategies

Cache-Aside (Lazy Loading)

Write-Through

Write-Behind (Write-Back)

Read-Through

Eviction Policies

Cache Failure Modes

Data Partitioning & Sharding

Range Partitioning

Hash Partitioning

Consistent Hashing

Directory-Based Sharding

Hot Spot Mitigation

Cross-Shard Challenges

Consistency Models

🔐 Consensus Algorithms

🔁 CRDTs — Conflict-Free Replicated Data Types

🕐 Quorum Reads & Writes (NWR Model)

Resiliency Patterns

Circuit Breaker

Retry with Backoff & Jitter

Bulkhead

Timeout & Deadline Propagation

Rate Limiting & Load Shedding

Fallback & Graceful Degradation

Architectural Patterns

📮 Saga Pattern

📝 Event Sourcing + CQRS

🔄 Outbox Pattern

🔐 Distributed Locking

Observability & SLOs

The Three Pillars

SLIs, SLOs & Error Budgets

USE & RED Methods

Distributed Systems
& High-Availability
Architecture