Distributed Systems Course

Welcome to the Distributed Systems Course! This course will take you from foundational concepts to building a working consensus-based system.

Why Learn Distributed Systems?

Distributed systems are everywhere. Every time you use a modern web service, you're interacting with a distributed system:

  • Social media platforms handling billions of users
  • E-commerce sites processing millions of transactions
  • Streaming services delivering content globally
  • Cloud databases storing and replicating data across continents

Understanding distributed systems is essential for building scalable, reliable applications.

Course Overview

This course teaches distributed systems concepts through hands-on implementation. Over 10 sessions, you will build four progressively complex distributed applications:

ApplicationSessionsConcepts
Queue/Work System1-2Producer-consumer, message passing, fault tolerance
Store with Replication3-5Partitioning, CAP theorem, leader election, consistency
Chat System6-7WebSockets, pub/sub, message ordering
Consensus System8-10Raft algorithm, log replication, state machine

What You'll Learn

By the end of this course, you will be able to:

  1. Explain distributed systems concepts including CAP theorem, consistency models, and consensus
  2. Build a working message queue system with producer-consumer pattern
  3. Implement a replicated key-value store with leader election
  4. Create a real-time chat system with pub/sub messaging
  5. Develop a consensus-based system using the Raft algorithm
  6. Deploy all systems using Docker Compose on your local machine

Target Audience

This course is designed for developers who:

  • Have basic programming experience (functions, classes, basic OOP)
  • Are new to distributed systems
  • Want to understand how modern distributed applications work
  • Prefer learning by doing over pure theory

Prerequisites

  • Programming: Comfortable with either TypeScript or Python
  • Command Line: Basic familiarity with terminal commands
  • Docker: We'll cover Docker setup in the Docker Setup section

No prior distributed systems experience is required!

Course Progression

graph TB
    subgraph "Part I: Fundamentals"
        A1[What is a DS?] --> A2[Message Passing]
        A2 --> A3[Queue System]
    end

    subgraph "Part II: Data Store"
        B1[Partitioning] --> B2[CAP Theorem]
        B2 --> B3[Replication]
        B3 --> B4[Consistency]
    end

    subgraph "Part III: Real-Time"
        C1[WebSockets] --> C2[Pub/Sub]
        C2 --> C3[Chat System]
    end

    subgraph "Part IV: Consensus"
        D1[What is Consensus?] --> D2[Raft Algorithm]
        D2 --> D3[Leader Election]
        D3 --> D4[Log Replication]
        D4 --> D5[Consensus System]
    end

    A3 --> B1
    B4 --> C1
    C3 --> D1

Course Format

Each 1.5-hour session follows this structure:

graph LR
    A[Review<br/>5 min] --> B[Concept<br/>20 min]
    B --> C[Diagram<br/>10 min]
    C --> D[Demo<br/>15 min]
    D --> E[Exercise<br/>25 min]
    E --> F[Test<br/>10 min]
    F --> G[Summary<br/>5 min]

Session Components

  • Concept Explanation: Clear, beginner-friendly explanations of core concepts
  • Visual Diagrams: Mermaid diagrams showing architecture and data flow
  • Live Demo: Step-by-step code walkthrough
  • Hands-on Exercise: Practical exercises to reinforce learning
  • Run & Test: Verify your implementation works correctly

Code Examples

Every concept includes implementations in both TypeScript and Python:

// TypeScript example
interface Message {
  id: string;
  content: string;
}
# Python example
@dataclass
class Message:
    id: str
    content: str

Choose the language you're most comfortable with, or learn both!

Before You Begin

1. Set Up Your Environment

Follow the Docker Setup Guide to install:

  • Docker and Docker Compose
  • Your preferred programming language (TypeScript or Python)

2. Verify Your Installation

docker --version
docker-compose --version

3. Choose Your Language

Decide whether you'll work with TypeScript or Python throughout the course. Both languages have complete examples for every concept.

Learning Tips

  • Don't rush: Each concept builds on the previous ones
  • Run the code: Follow along with the examples in your terminal
  • Experiment: Modify the code and see what happens
  • Ask questions: Use the troubleshooting guide when stuck
  • Build in public: Share your progress and learn from others

What You'll Build

By the end of this course, you'll have four working distributed systems:

  1. Queue System - A fault-tolerant task processing system
  2. Replicated Store - A key-value store with leader election
  3. Chat System - A real-time messaging system with presence
  4. Consensus System - A Raft-based distributed database

All systems run locally using Docker Compose—no cloud infrastructure required!

Let's Get Started!

Ready to dive in? Continue to Chapter 1: What is a Distributed System?

What is a Distributed System?

Session 1, Part 1 - 20 minutes

Learning Objectives

  • Define what a distributed system is
  • Identify key characteristics of distributed systems
  • Understand why distributed systems matter
  • Recognize distributed systems in everyday life

Definition

A distributed system is a collection of independent computers that appears to its users as a single coherent system.

graph TB
    subgraph "Users See"
        Single["Single System"]
    end

    subgraph "Reality"
        N1["Node 1"]
        N2["Node 2"]
        N3["Node 3"]
        N4["Node N"]

        N1 <--> N2
        N2 <--> N3
        N3 <--> N4
        N4 <--> N1
    end

    Single -->|"appears as"| N1
    Single -->|"appears as"| N2
    Single -->|"appears as"| N3

Key Insight

The defining characteristic is the illusion of unity—users interact with what seems like one system, while behind the scenes, multiple machines work together.

Three Key Characteristics

According to Leslie Lamport, a distributed system is:

"One in which the failure of a computer you didn't even know existed can render your own computer unusable."

This definition highlights three fundamental characteristics:

1. Concurrency (Multiple Things Happen At Once)

Multiple components execute simultaneously, leading to complex interactions.

sequenceDiagram
    participant U as User
    participant A as Server A
    participant B as Server B
    participant C as Server C

    U->>A: Request
    A->>B: Query
    A->>C: Update
    B-->>A: Response
    C-->>A: Ack
    A-->>U: Result

2. No Global Clock

Each node has its own clock. There's no single "now" across the system.

graph LR
    A[Clock A: 10:00:01.123]
    B[Clock B: 10:00:02.456]
    C[Clock C: 09:59:59.789]

    A -.->|network latency| B
    B -.->|network latency| C
    C -.->|network latency| A

Implication: You can't rely on timestamps to order events across nodes. You need logical clocks (more on this in later sessions!).

3. Independent Failure

Components can fail independently. When one part fails, the rest may continue—or may become unusable.

stateDiagram-v2
    [*] --> AllHealthy: System Start
    AllHealthy --> PartialFailure: One Node Fails
    AllHealthy --> CompleteFailure: Critical Nodes Fail
    PartialFailure --> AllHealthy: Recovery
    PartialFailure --> CompleteFailure: Cascading Failure
    CompleteFailure --> [*]

Why Distributed Systems?

Scalability

Vertical Scaling (Scale Up):

  • Add more resources to a single machine
  • Eventually hits hardware/cost limits

Horizontal Scaling (Scale Out):

  • Add more machines to the system
  • Virtually unlimited scaling potential
graph TB
    subgraph "Vertical Scaling"
        Big[Big Expensive Server<br/>$100,000]
    end

    subgraph "Horizontal Scaling"
        S1[Commodity Server<br/>$1,000]
        S2[Commodity Server<br/>$1,000]
        S3[Commodity Server<br/>$1,000]
        S4[...]
    end

    Big <--> S1
    Big <--> S2
    Big <--> S3

Reliability & Availability

A single point of failure is unacceptable for critical services:

graph TB
    subgraph "Single System"
        S[Single Server]
        S -.-> X[❌ Failure = No Service]
    end

    subgraph "Distributed System"
        N1[Node 1]
        N2[Node 2]
        N3[Node 3]

        N1 <--> N2
        N2 <--> N3
        N3 <--> N1

        N1 -.-> X2[❌ One Fails]
        X2 --> OK[✓ Others Continue]
    end

Latency (Geographic Distribution)

Placing data closer to users improves experience:

graph TB
    User[User in NYC]

    subgraph "Global Distribution"
        NYC[NYC Datacenter<br/>10ms latency]
        LON[London Datacenter<br/>70ms latency]
        TKY[Tokyo Datacenter<br/>150ms latency]
    end

    User --> NYC
    User -.-> LON
    User -.-> TKY

    NYC <--> LON
    LON <--> TKY
    TKY <--> NYC

Examples of Distributed Systems

Everyday Examples

SystemDescriptionBenefit
Web SearchQuery servers, index servers, cache serversFast responses, always available
Streaming VideoContent delivery networks (CDNs)Low latency, high quality
Online ShoppingProduct catalog, cart, payment, inventoryHandles traffic spikes
Social MediaPosts, comments, likes, notificationsReal-time updates

Technical Examples

Database Replication:

graph LR
    W[Write to Primary] --> P[(Primary DB)]
    P --> R1[(Replica 1)]
    P --> R2[(Replica 2)]
    P --> R3[(Replica 3)]
    R1 --> Read1[Read from Replica]
    R2 --> Read2[Read from Replica]
    R3 --> Read3[Read from Replica]

Load Balancing:

graph TB
    Users[Users]
    LB[Load Balancer]

    Users --> LB
    LB --> S1[Server 1]
    LB --> S2[Server 2]
    LB --> S3[Server 3]
    LB --> S4[Server N]

Trade-offs

Distributed systems introduce complexity:

ChallengeDescription
Network IssuesUnreliable, variable latency, partitions
ConcurrencyRace conditions, deadlocks, coordination
Partial FailuresSome components work, others don't
ConsistencyKeeping data in sync across nodes

The Fundamental Dilemma:

"Is the benefits of distribution worth the added complexity?"

For most modern applications, the answer is yes—which is why we're learning this!

Summary

Key Takeaways

  1. Distributed systems = multiple computers acting as one
  2. Three characteristics: concurrency, no global clock, independent failure
  3. Benefits: scalability, reliability, lower latency
  4. Costs: complexity, network issues, consistency challenges

Check Your Understanding

  • Can you explain why there's no global clock in a distributed system?
  • Give an example of a distributed system you use daily
  • Why does independent failure make distributed systems harder to build?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

Now that we understand what distributed systems are, let's explore how they communicate: Message Passing

Message Passing

Session 1, Part 2 - 25 minutes

Learning Objectives

  • Understand message passing as a fundamental pattern in distributed systems
  • Distinguish between synchronous and asynchronous messaging
  • Learn different message delivery guarantees
  • Implement basic message passing in TypeScript and Python

What is Message Passing?

In distributed systems, message passing is how nodes communicate. Instead of shared memory or direct function calls, components send messages to each other over the network.

graph LR
    A[Node A]
    B[Node B]
    M[Message]

    A -->|send| M
    M -->|network| B
    B -->|process| M

Key Insight

"In distributed systems, communication is not a function call—it's a request sent over an unreliable network."

This simple fact has profound implications for everything we build.

Synchronous vs Asynchronous

Synchronous Messaging (Request-Response)

The sender waits for a response before continuing.

sequenceDiagram
    participant C as Client
    participant S as Server

    C->>S: Request
    Note over C: Waiting...
    S-->>C: Response
    Note over C: Continue

Characteristics:

  • Simple to understand and implement
  • Caller is blocked during the call
  • Easier error handling (immediate feedback)
  • Can lead to poor performance and cascading failures

Asynchronous Messaging (Fire-and-Forget)

The sender continues without waiting for a response.

sequenceDiagram
    participant P as Producer
    participant Q as Queue
    participant W as Worker

    P->>Q: Send Message
    Note over P: Continue immediately

    Q->>W: Process Later
    Note over W: Working...
    W-->>P: Result (optional)

Characteristics:

  • Non-blocking, better throughput
  • More complex error handling
  • Requires correlation IDs to track requests
  • Enables loose coupling between components

Message Delivery Guarantees

Three Delivery Semantics

graph TB
    subgraph "At Most Once"
        A1[Send] --> A2[May be lost]
        A2 --> A3[Never duplicated]
    end

    subgraph "At Least Once"
        B1[Send] --> B2[Retries until ack]
        B2 --> B3[May be duplicated]
    end

    subgraph "Exactly Once"
        C1[Send] --> C2[Deduplication]
        C2 --> C3[Perfect delivery]
    end

Comparison

GuaranteeDescriptionCostUse Case
At Most OnceMessage may be lost, never duplicatedLowestLogging, metrics, non-critical data
At Least OnceMessage guaranteed to arrive, may duplicateMediumNotifications, job queues
Exactly OncePerfect delivery, no duplicatesHighestFinancial transactions, payments

The Two Generals Problem

A classic proof that perfect communication is impossible in unreliable networks:

graph LR
    A[General A<br/>City 1]
    B[General B<br/>City 2]

    A -->|"Attack at 8pm?"| B
    B -->|"Ack: received"| A
    A -->|"Ack: received your ack"| B
    B -->|"Ack: received your ack of ack"| A

    Note[A: infinite messages needed]

Implication: You can never be 100% certain a message was received without infinite acknowledgments.

In practice, we accept uncertainty and design systems that tolerate it.

Architecture Patterns

Direct Communication

graph LR
    A[Service A] --> B[Service B]
    A --> C[Service C]
    B --> D[Service D]
    C --> D
  • Simple, straightforward
  • Tight coupling
  • Difficult to scale independently

Message Queue (Indirect Communication)

graph TB
    P[Producer 1] --> Q[Message Queue]
    P2[Producer 2] --> Q
    P3[Producer N] --> Q

    Q --> W1[Worker 1]
    Q --> W2[Worker 2]
    Q --> W3[Worker N]
  • Loose coupling
  • Easy to scale
  • Buffers requests during traffic spikes
  • Enables retry and error handling

Implementation Examples

TypeScript: HTTP (Synchronous)

// server.ts
import http from 'http';

const server = http.createServer((req, res) => {
  if (req.method === 'POST' && req.url === '/message') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      const message = JSON.parse(body);
      console.log('Received:', message);

      // Send response back (synchronous)
      res.writeHead(200);
      res.end(JSON.stringify({ status: 'processed', id: message.id }));
    });
  }
});

server.listen(3000, () => console.log('Server on :3000'));

// client.ts
import http from 'http';

function sendMessage(data: any): Promise<any> {
  return new Promise((resolve, reject) => {
    const postData = JSON.stringify(data);

    const options = {
      hostname: 'localhost',
      port: 3000,
      method: 'POST',
      path: '/message',
      headers: { 'Content-Type': 'application/json' }
    };

    const req = http.request(options, (res) => {
      let body = '';
      res.on('data', chunk => body += chunk);
      res.on('end', () => resolve(JSON.parse(body)));
    });

    req.on('error', reject);
    req.write(postData);
    req.end();
  });
}

// Usage: waits for response
sendMessage({ id: '1', content: 'Hello' })
  .then(response => console.log('Got:', response));

Python: HTTP (Synchronous)

# server.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

class MessageHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        if self.path == '/message':
            content_length = int(self.headers['Content-Length'])
            post_data = self.rfile.read(content_length)
            message = json.loads(post_data.decode())

            print(f"Received: {message}")

            # Send response back (synchronous)
            response = json.dumps({'status': 'processed', 'id': message['id']})
            self.send_response(200)
            self.send_header('Content-Type', 'application/json')
            self.end_headers()
            self.wfile.write(response.encode())

server = HTTPServer(('localhost', 3000), MessageHandler)
print("Server on :3000")
server.serve_forever()

# client.py
import requests
import json

def send_message(data):
    # Synchronous: waits for response
    response = requests.post(
        'http://localhost:3000/message',
        json=data
    )
    return response.json()

# Usage
result = send_message({'id': '1', 'content': 'Hello'})
print(f"Got: {result}")

TypeScript: Simple Queue (Asynchronous)

// queue.ts
interface Message {
  id: string;
  data: any;
  timestamp: number;
}

class MessageQueue {
  private messages: Message[] = [];
  private handlers: Map<string, (msg: Message) => void> = new Map();

  publish(topic: string, data: any): string {
    const message: Message = {
      id: `${Date.now()}-${Math.random()}`,
      data,
      timestamp: Date.now()
    };

    this.messages.push(message);
    console.log(`Published to ${topic}:`, message.id);

    // Fire and forget - don't wait for processing
    setImmediate(() => this.process(topic, message));

    return message.id;
  }

  subscribe(topic: string, handler: (msg: Message) => void) {
    this.handlers.set(topic, handler);
  }

  private process(topic: string, message: Message) {
    const handler = this.handlers.get(topic);
    if (handler) {
      // Handle asynchronously - caller doesn't wait
      handler(message);
    }
  }
}

// Usage
const queue = new MessageQueue();

queue.subscribe('tasks', (msg) => {
  console.log(`Processing task ${msg.id}:`, msg.data);
  // Simulate async work
  setTimeout(() => console.log(`Task ${msg.id} complete`), 1000);
});

// Publish returns immediately - doesn't wait for processing
const taskId = queue.publish('tasks', { type: 'email', to: 'user@example.com' });
console.log(`Task ${taskId} queued (not yet processed)`);

Python: Simple Queue (Asynchronous)

# queue.py
import time
import threading
from dataclasses import dataclass
from typing import Callable, Dict, Any
import uuid

@dataclass
class Message:
    id: str
    data: Any
    timestamp: float

class MessageQueue:
    def __init__(self):
        self.messages = []
        self.handlers: Dict[str, Callable[[Message], None]] = {}
        self.lock = threading.Lock()

    def publish(self, topic: str, data: Any) -> str:
        message = Message(
            id=f"{int(time.time()*1000)}-{uuid.uuid4().hex[:8]}",
            data=data,
            timestamp=time.time()
        )

        with self.lock:
            self.messages.append(message)

        print(f"Published to {topic}: {message.id}")

        # Fire and forget - don't wait for processing
        threading.Thread(
            target=self._process,
            args=(topic, message),
            daemon=True
        ).start()

        return message.id

    def subscribe(self, topic: str, handler: Callable[[Message], None]):
        self.handlers[topic] = handler

    def _process(self, topic: str, message: Message):
        handler = self.handlers.get(topic)
        if handler:
            # Handle asynchronously - caller doesn't wait
            handler(message)

# Usage
queue = MessageQueue()

def handle_task(msg: Message):
    print(f"Processing task {msg.id}: {msg.data}")
    # Simulate async work
    time.sleep(1)
    print(f"Task {msg.id} complete")

queue.subscribe('tasks', handle_task)

# Publish returns immediately - doesn't wait for processing
task_id = queue.publish('tasks', {'type': 'email', 'to': 'user@example.com'})
print(f"Task {task_id} queued (not yet processed)")

# Keep main thread alive to see processing
time.sleep(2)

Common Message Patterns

Request-Response

// Call and wait for answer
const answer = await ask(question);

Fire-and-Forget

// Send and continue
notify(user);

Publish-Subscribe

// Many receivers, one sender
broker.publish('events', data);

Request-Reply (Correlation)

// Send request, get reply later
const replyTo = createReplyQueue();
broker.send(request, { replyTo });
// ... later
const reply = await replyTo.receive();

Error Handling

Message passing over networks is unreliable. Common issues:

ErrorCauseHandling Strategy
TimeoutNo response, network slowRetry with backoff
Connection RefusedService downCircuit breaker, queue for later
Message LostNetwork failureAcknowledgments, retries
DuplicateRetry after slow ackIdempotent operations

Retry Pattern

async function sendMessageWithRetry(
  message: any,
  maxRetries = 3
): Promise<any> {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await sendMessage(message);
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff: 100ms, 200ms, 400ms
      const delay = 100 * Math.pow(2, attempt - 1);
      await new Promise(r => setTimeout(r, delay));
      console.log(`Retry ${attempt}/${maxRetries}`);
    }
  }
}

Summary

Key Takeaways

  1. Message passing is how distributed systems communicate
  2. Synchronous = wait for response; Asynchronous = fire and forget
  3. Delivery guarantees: at-most-once, at-least-once, exactly-once
  4. Network is unreliable - design for failures and retries
  5. Choose the right pattern for your use case

Check Your Understanding

  • When would you use synchronous vs asynchronous messaging?
  • What's the difference between at-least-once and exactly-once?
  • Why is perfect communication impossible in distributed systems?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

Now let's apply message passing to build our first distributed system: Queue System Implementation

Queue System Implementation

Session 2 - Full session (90 minutes)

Learning Objectives

  • Understand the producer-consumer pattern
  • Build a working queue system with concurrent workers
  • Implement fault tolerance with retry logic
  • Deploy and test the system using Docker Compose

The Producer-Consumer Pattern

The producer-consumer pattern is a fundamental distributed systems pattern where:

  • Producers create and send tasks to a queue
  • Queue buffers tasks between producers and consumers
  • Workers (consumers) process tasks from the queue
graph TB
    subgraph "Producers"
        P1[Producer 1<br/>API Server]
        P2[Producer 2<br/>Scheduler]
        P3[Producer N<br/>Webhook]
    end

    subgraph "Queue"
        Q[Message Queue<br/>Task Buffer]
    end

    subgraph "Workers"
        W1[Worker 1<br/>Process]
        W2[Worker 2<br/>Process]
        W3[Worker 3<br/>Process]
    end

    P1 --> Q
    P2 --> Q
    P3 --> Q
    Q --> W1
    Q --> W2
    Q --> W3

    style Q fill:#f9f,stroke:#333,stroke-width:4px

Key Benefits

BenefitExplanation
DecouplingProducers don't need to know about workers
BufferingQueue handles traffic spikes
ScalabilityAdd/remove workers independently
ReliabilityTasks persist if workers fail
RetryFailed tasks can be requeued

System Architecture

Full System View

sequenceDiagram
    participant C as Client
    participant P as Producer
    participant Q as Queue
    participant W as Worker
    participant DB as Result Store

    C->>P: HTTP POST /task
    P->>Q: Enqueue Task
    Q-->>P: Task ID
    P-->>C: 202 Accepted

    Note over Q,W: Async Processing

    Q->>W: Fetch Task
    W->>W: Process Task
    W->>DB: Save Result

    W->>Q: Ack (Success)
    Q->>Q: Remove Task

Task Lifecycle

stateDiagram-v2
    [*] --> Pending: Producer creates
    Pending --> Processing: Worker fetches
    Processing --> Completed: Success
    Processing --> Failed: Error
    Processing --> Pending: Retry
    Failed --> Pending: Max retries not reached
    Failed --> DeadLetter: Max retries reached
    Completed --> [*]
    DeadLetter --> [*]

Implementation

Data Models

Task Definition:

interface Task {
  id: string;
  type: string;           // 'email', 'image', 'report', etc.
  payload: any;
  status: 'pending' | 'processing' | 'completed' | 'failed';
  createdAt: number;
  retries: number;
  maxRetries: number;
  result?: any;
  error?: string;
}
from dataclasses import dataclass, field
from typing import Any, Optional

@dataclass
class Task:
    id: str
    type: str  # 'email', 'image', 'report', etc.
    payload: Any
    status: str = 'pending'  # pending, processing, completed, failed
    created_at: float = field(default_factory=time.time)
    retries: int = 0
    max_retries: int = 3
    result: Optional[Any] = None
    error: Optional[str] = None

TypeScript Implementation

Project Structure

queue-system-ts/
├── package.json
├── docker-compose.yml
├── src/
│   ├── queue.ts          # Queue implementation
│   ├── producer.ts       # Producer API
│   ├── worker.ts         # Worker implementation
│   └── types.ts          # Type definitions
└── Dockerfile

Complete TypeScript Code

queue-system-ts/src/types.ts

export interface Task {
  id: string;
  type: string;
  payload: any;
  status: 'pending' | 'processing' | 'completed' | 'failed';
  createdAt: number;
  retries: number;
  maxRetries: number;
  result?: any;
  error?: string;
}

export interface QueueMessage {
  task: Task;
  timestamp: number;
}

queue-system-ts/src/queue.ts

import { Task, QueueMessage } from './types';

export class Queue {
  private pending: Task[] = [];
  private processing: Map<string, Task> = new Map();
  private completed: Task[] = [];
  private failed: Task[] = [];

  // Enqueue a new task
  enqueue(type: string, payload: any): string {
    const task: Task = {
      id: this.generateId(),
      type,
      payload,
      status: 'pending',
      createdAt: Date.now(),
      retries: 0,
      maxRetries: 3
    };

    this.pending.push(task);
    console.log(`[Queue] Enqueued task ${task.id} (${type})`);
    return task.id;
  }

  // Get next pending task (for workers)
  dequeue(): Task | null {
    if (this.pending.length === 0) return null;

    const task = this.pending.shift()!;
    task.status = 'processing';
    this.processing.set(task.id, task);

    console.log(`[Queue] Dequeued task ${task.id}`);
    return task;
  }

  // Mark task as completed
  complete(taskId: string, result?: any): void {
    const task = this.processing.get(taskId);
    if (!task) return;

    task.status = 'completed';
    task.result = result;
    this.processing.delete(taskId);
    this.completed.push(task);

    console.log(`[Queue] Completed task ${taskId}`);
  }

  // Mark task as failed (will retry if possible)
  fail(taskId: string, error: string): void {
    const task = this.processing.get(taskId);
    if (!task) return;

    task.retries++;
    task.error = error;

    if (task.retries >= task.maxRetries) {
      task.status = 'failed';
      this.processing.delete(taskId);
      this.failed.push(task);
      console.log(`[Queue] Task ${taskId} failed permanently after ${task.retries} retries`);
    } else {
      task.status = 'pending';
      this.processing.delete(taskId);
      this.pending.push(task);
      console.log(`[Queue] Task ${taskId} failed, retrying (${task.retries}/${task.maxRetries})`);
    }
  }

  // Get queue statistics
  getStats() {
    return {
      pending: this.pending.length,
      processing: this.processing.size,
      completed: this.completed.length,
      failed: this.failed.length
    };
  }

  private generateId(): string {
    return `task-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
  }
}

queue-system-ts/src/producer.ts

import http from 'http';
import { Queue } from './queue';

const queue = new Queue();

const server = http.createServer((req, res) => {
  if (req.method === 'POST' && req.url === '/task') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const { type, payload } = JSON.parse(body);

        if (!type || !payload) {
          res.writeHead(400);
          res.end(JSON.stringify({ error: 'type and payload required' }));
          return;
        }

        const taskId = queue.enqueue(type, payload);

        res.writeHead(202); // Accepted
        res.end(JSON.stringify({
          taskId,
          message: 'Task enqueued',
          stats: queue.getStats()
        }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid JSON' }));
      }
    });
  } else if (req.method === 'GET' && req.url === '/stats') {
    res.writeHead(200);
    res.end(JSON.stringify(queue.getStats()));
  } else {
    res.writeHead(404);
    res.end(JSON.stringify({ error: 'Not found' }));
  }
});

const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
  console.log(`Producer API listening on port ${PORT}`);
});

export { queue };

queue-system-ts/src/worker.ts

import http from 'http';
import { Queue, Task } from './types';

// Simulate task processing
async function processTask(task: Task): Promise<any> {
  console.log(`[Worker] Processing task ${task.id} (${task.type})`);

  // Simulate work
  await new Promise(resolve => setTimeout(resolve, 1000 + Math.random() * 2000));

  // Simulate occasional failures (20% chance)
  if (Math.random() < 0.2) {
    throw new Error('Random processing error');
  }

  // Process based on task type
  switch (task.type) {
    case 'email':
      return { sent: true, to: task.payload.to };
    case 'image':
      return { processed: true, url: task.payload.url };
    case 'report':
      return { generated: true, format: 'pdf' };
    default:
      return { result: 'processed' };
  }
}

class Worker {
  private id: string;
  private queueUrl: string;
  private running: boolean = false;

  constructor(id: string, queueUrl: string) {
    this.id = id;
    this.queueUrl = queueUrl;
  }

  async start(): Promise<void> {
    this.running = true;
    console.log(`[Worker ${this.id}] Started`);

    while (this.running) {
      try {
        await this.processNextTask();
      } catch (error) {
        console.error(`[Worker ${this.id}] Error:`, error);
        await this.sleep(1000); // Wait before retrying
      }
    }
  }

  private async processNextTask(): Promise<void> {
    // Fetch task from queue
    const task = await this.fetchTask();
    if (!task) {
      await this.sleep(1000); // No task, wait
      return;
    }

    try {
      // Process the task
      const result = await processTask(task);

      // Mark as complete
      await this.completeTask(task.id, result);
    } catch (error: any) {
      // Mark as failed
      await this.failTask(task.id, error.message);
    }
  }

  private async fetchTask(): Promise<Task | null> {
    return new Promise((resolve, reject) => {
      http.get(`${this.queueUrl}/dequeue`, (res) => {
        let body = '';
        res.on('data', chunk => body += chunk);
        res.on('end', () => {
          if (res.statusCode === 204) {
            resolve(null); // No tasks available
          } else if (res.statusCode === 200) {
            resolve(JSON.parse(body));
          } else {
            reject(new Error(`Unexpected status: ${res.statusCode}`));
          }
        });
      }).on('error', reject);
    });
  }

  private async completeTask(taskId: string, result: any): Promise<void> {
    return new Promise((resolve, reject) => {
      const data = JSON.stringify({ result });
      http.request({
        hostname: 'localhost',
        port: 3000,
        path: `/complete/${taskId}`,
        method: 'POST',
        headers: { 'Content-Type': 'application/json', 'Content-Length': data.length }
      }, (res) => {
        if (res.statusCode === 200) {
          resolve();
        } else {
          reject(new Error(`Failed to complete task: ${res.statusCode}`));
        }
      }).on('error', reject).end(data);
    });
  }

  private async failTask(taskId: string, error: string): Promise<void> {
    return new Promise((resolve, reject) => {
      const data = JSON.stringify({ error });
      http.request({
        hostname: 'localhost',
        port: 3000,
        path: `/fail/${taskId}`,
        method: 'POST',
        headers: { 'Content-Type': 'application/json', 'Content-Length': data.length }
      }, (res) => {
        if (res.statusCode === 200) {
          resolve();
        } else {
          reject(new Error(`Failed to fail task: ${res.statusCode}`));
        }
      }).on('error', reject).end(data);
    });
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  stop(): void {
    this.running = false;
  }
}

// Start worker
const workerId = process.env.WORKER_ID || 'worker-1';
const worker = new Worker(workerId, 'http://localhost:3000');
worker.start();

Python Implementation

Project Structure

queue-system-py/
├── requirements.txt
├── docker-compose.yml
├── src/
│   ├── queue.py          # Queue implementation
│   ├── producer.py       # Producer API
│   └── worker.py         # Worker implementation
└── Dockerfile

Complete Python Code

queue-system-py/src/queue.py

import time
import uuid
from dataclasses import dataclass, field
from typing import Any, Optional, List, Dict
from enum import Enum

class TaskStatus(Enum):
    PENDING = 'pending'
    PROCESSING = 'processing'
    COMPLETED = 'completed'
    FAILED = 'failed'

@dataclass
class Task:
    id: str
    type: str
    payload: Any
    status: str = TaskStatus.PENDING.value
    created_at: float = field(default_factory=time.time)
    retries: int = 0
    max_retries: int = 3
    result: Optional[Any] = None
    error: Optional[str] = None

class Queue:
    def __init__(self):
        self.pending: List[Task] = []
        self.processing: Dict[str, Task] = {}
        self.completed: List[Task] = []
        self.failed: List[Task] = []

    def enqueue(self, task_type: str, payload: Any) -> str:
        """Enqueue a new task."""
        task = Task(
            id=f"task-{int(time.time()*1000)}-{uuid.uuid4().hex[:8]}",
            type=task_type,
            payload=payload
        )
        self.pending.append(task)
        print(f"[Queue] Enqueued task {task.id} ({task_type})")
        return task.id

    def dequeue(self) -> Optional[Task]:
        """Get next pending task."""
        if not self.pending:
            return None

        task = self.pending.pop(0)
        task.status = TaskStatus.PROCESSING.value
        self.processing[task.id] = task
        print(f"[Queue] Dequeued task {task.id}")
        return task

    def complete(self, task_id: str, result: Any = None) -> None:
        """Mark task as completed."""
        task = self.processing.pop(task_id, None)
        if not task:
            return

        task.status = TaskStatus.COMPLETED.value
        task.result = result
        self.completed.append(task)
        print(f"[Queue] Completed task {task_id}")

    def fail(self, task_id: str, error: str) -> None:
        """Mark task as failed (will retry if possible)."""
        task = self.processing.pop(task_id, None)
        if not task:
            return

        task.retries += 1
        task.error = error

        if task.retries >= task.max_retries:
            task.status = TaskStatus.FAILED.value
            self.failed.append(task)
            print(f"[Queue] Task {task_id} failed permanently after {task.retries} retries")
        else:
            task.status = TaskStatus.PENDING.value
            self.pending.append(task)
            print(f"[Queue] Task {task_id} failed, retrying ({task.retries}/{task.max_retries})")

    def get_stats(self) -> Dict[str, int]:
        """Get queue statistics."""
        return {
            'pending': len(self.pending),
            'processing': len(self.processing),
            'completed': len(self.completed),
            'failed': len(self.failed)
        }

queue-system-py/src/producer.py

from http.server import HTTPServer, BaseHTTPRequestHandler
import json
from queue import Queue

queue = Queue()

class ProducerHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        if self.path == '/task':
            content_length = int(self.headers['Content-Length'])
            post_data = self.rfile.read(content_length)

            try:
                data = json.loads(post_data.decode())
                task_type = data.get('type')
                payload = data.get('payload')

                if not task_type or not payload:
                    self.send_error(400, 'type and payload required')
                    return

                task_id = queue.enqueue(task_type, payload)

                response = json.dumps({
                    'taskId': task_id,
                    'message': 'Task enqueued',
                    'stats': queue.get_stats()
                })

                self.send_response(202)  # Accepted
                self.send_header('Content-Type', 'application/json')
                self.end_headers()
                self.wfile.write(response.encode())

            except json.JSONDecodeError:
                self.send_error(400, 'Invalid JSON')

    def do_GET(self):
        if self.path == '/stats':
            response = json.dumps(queue.get_stats())
            self.send_response(200)
            self.send_header('Content-Type', 'application/json')
            self.end_headers()
            self.wfile.write(response.encode())

    def log_message(self, format, *args):
        pass  # Suppress default logging

if __name__ == '__main__':
    import os
    port = int(os.environ.get('PORT', 3000))
    server = HTTPServer(('0.0.0.0', port), ProducerHandler)
    print(f"Producer API listening on port {port}")
    server.serve_forever()

queue-system-py/src/worker.py

import os
import time
import random
import requests
from typing import Optional, Dict, Any
from queue import Task

# Simulate task processing
def process_task(task: Task) -> Any:
    print(f"[Worker] Processing task {task.id} ({task.type})")

    # Simulate work
    time.sleep(1 + random.random() * 2)

    # Simulate occasional failures (20% chance)
    if random.random() < 0.2:
        raise Exception('Random processing error')

    # Process based on task type
    if task.type == 'email':
        return {'sent': True, 'to': task.payload.get('to')}
    elif task.type == 'image':
        return {'processed': True, 'url': task.payload.get('url')}
    elif task.type == 'report':
        return {'generated': True, 'format': 'pdf'}
    else:
        return {'result': 'processed'}

class Worker:
    def __init__(self, worker_id: str, queue_url: str):
        self.id = worker_id
        self.queue_url = queue_url
        self.running = False

    def start(self):
        """Start the worker loop."""
        self.running = True
        print(f"[Worker {self.id}] Started")

        while self.running:
            try:
                self.process_next_task()
            except Exception as e:
                print(f"[Worker {self.id}] Error: {e}")
                time.sleep(1)

    def process_next_task(self):
        """Fetch and process the next task."""
        task = self.fetch_task()
        if not task:
            time.sleep(1)  # No task, wait
            return

        try:
            result = process_task(task)
            self.complete_task(task['id'], result)
        except Exception as e:
            self.fail_task(task['id'], str(e))

    def fetch_task(self) -> Optional[Dict]:
        """Fetch next task from queue."""
        try:
            response = requests.get(f"{self.queue_url}/dequeue", timeout=5)
            if response.status_code == 204:
                return None  # No tasks
            return response.json()
        except requests.RequestException:
            return None

    def complete_task(self, task_id: str, result: Any):
        """Mark task as complete."""
        requests.post(
            f"{self.queue_url}/complete/{task_id}",
            json={'result': result},
            timeout=5
        )

    def fail_task(self, task_id: str, error: str):
        """Mark task as failed."""
        requests.post(
            f"{self.queue_url}/fail/{task_id}",
            json={'error': error},
            timeout=5
        )

    def stop(self):
        """Stop the worker."""
        self.running = False

if __name__ == '__main__':
    worker_id = os.environ.get('WORKER_ID', 'worker-1')
    queue_url = os.environ.get('QUEUE_URL', 'http://localhost:3000')
    worker = Worker(worker_id, queue_url)
    worker.start()

Docker Compose Setup

TypeScript Version (docker-compose.yml)

version: '3.8'

services:
  producer:
    build: ./src
    ports:
      - "3000:3000"
    environment:
      - PORT=3000
    volumes:
      - ./src:/app/src
    command: npm run start:producer

  worker-1:
    build: ./src
    environment:
      - WORKER_ID=worker-1
    depends_on:
      - producer
    volumes:
      - ./src:/app/src
    command: npm run start:worker

  worker-2:
    build: ./src
    environment:
      - WORKER_ID=worker-2
    depends_on:
      - producer
    volumes:
      - ./src:/app/src
    command: npm run start:worker

  worker-3:
    build: ./src
    environment:
      - WORKER_ID=worker-3
    depends_on:
      - producer
    volumes:
      - ./src:/app/src
    command: npm run start:worker

TypeScript Dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .

CMD ["npm", "run", "start:producer"]

Python Version (docker-compose.yml)

version: '3.8'

services:
  producer:
    build: ./src
    ports:
      - "3000:3000"
    environment:
      - PORT=3000
    volumes:
      - ./src:/app/src
    command: python src/producer.py

  worker-1:
    build: ./src
    environment:
      - WORKER_ID=worker-1
      - QUEUE_URL=http://producer:3000
    depends_on:
      - producer
    volumes:
      - ./src:/app/src
    command: python src/worker.py

  worker-2:
    build: ./src
    environment:
      - WORKER_ID=worker-2
      - QUEUE_URL=http://producer:3000
    depends_on:
      - producer
    volumes:
      - ./src:/app/src
    command: python src/worker.py

  worker-3:
    build: ./src
    environment:
      - WORKER_ID=worker-3
      - QUEUE_URL=http://producer:3000
    depends_on:
      - producer
    volumes:
      - ./src:/app/src
    command: python src/worker.py

Python Dockerfile

FROM python:3.11-alpine

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "src/producer.py"]

Running the Example

Step 1: Start the System

cd examples/01-queue
docker-compose up --build

You should see output like:

producer      | Producer API listening on port 3000
worker-1      | [Worker worker-1] Started
worker-2      | [Worker worker-2] Started
worker-3      | [Worker worker-3] Started

Step 2: Submit Tasks

Open a new terminal and submit some tasks:

# Submit an email task
curl -X POST http://localhost:3000/task \
  -H "Content-Type: application/json" \
  -d '{"type": "email", "payload": {"to": "user@example.com", "subject": "Hello"}}'

# Submit an image processing task
curl -X POST http://localhost:3000/task \
  -H "Content-Type: application/json" \
  -d '{"type": "image", "payload": {"url": "https://example.com/image.jpg"}}'

# Submit multiple tasks
for i in {1..10}; do
  curl -X POST http://localhost:3000/task \
    -H "Content-Type: application/json" \
    -d "{\"type\": \"report\", \"payload\": {\"id\": $i}}"
done

Step 3: Watch Processing

In the Docker logs, you'll see:

worker-2      | [Queue] Dequeued task task-1234567890-abc123
worker-2      | [Worker] Processing task task-1234567890-abc123 (report)
worker-2      | [Queue] Completed task task-1234567890-abc123

Step 4: Check Statistics

curl http://localhost:3000/stats

Response:

{
  "pending": 5,
  "processing": 3,
  "completed": 12,
  "failed": 0
}

Step 5: Test Fault Tolerance

Stop one worker:

docker-compose stop worker-1

Tasks continue processing with the remaining workers. The queue automatically handles the load redistribution.

Exercises

Exercise 1: Add Priority Support

Modify the queue to support high/normal/low priority tasks:

  1. Add a priority field to the Task model
  2. Modify enqueue() to sort pending tasks by priority
  3. Test with mixed priority tasks

Exercise 2: Implement Dead Letter Queue

Create a separate queue for permanently failed tasks:

  1. Add a dead_letter queue to store failed tasks
  2. Add an API endpoint to inspect/retry dead letter tasks
  3. Log failed tasks to a file for manual inspection

Exercise 3: Add Task Scheduling

Implement delayed task execution:

  1. Add an executeAt timestamp to tasks
  2. Modify workers to skip tasks scheduled for the future
  3. Use a timer/scheduler to move scheduled tasks to pending queue

Summary

Key Takeaways

  1. Producer-consumer pattern decouples task creation from processing
  2. Queues buffer tasks and handle traffic spikes
  3. Workers scale independently of producers
  4. Retry logic provides fault tolerance
  5. Docker Compose enables easy local deployment

Check Your Understanding

  • How does the queue handle worker failures?
  • What happens when a task fails and max retries is reached?
  • Why is the queue useful for handling traffic spikes?
  • How would you add a new worker type (e.g., a worker that only processes emails)?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

Now that we've built a queue system, let's explore how to partition data across multiple nodes: Data Partitioning

Data Partitioning

Session 3, Part 1 - 25 minutes

Learning Objectives

  • Understand what data partitioning (sharding) is
  • Compare hash-based vs range-based partitioning
  • Learn how partitioning affects query performance
  • Recognize the trade-offs of different partitioning strategies

What is Partitioning?

Data partitioning (also called sharding) is the process of splitting your data across multiple nodes based on a partitioning key. Each node holds a subset of the total data.

graph TB
    subgraph "Application View"
        App["Your Application"]
        Data[("All Data")]
        App --> Data
    end

    subgraph "Reality: Partitioned Storage"
        Node1["Node 1<br/>Keys: user_1<br/>user_4<br/>user_7"]
        Node2["Node 2<br/>Keys: user_2<br/>user_5<br/>user_8"]
        Node3["Node 3<br/>Keys: user_3<br/>user_6<br/>user_9"]
    end

    App -->|"read/write"| Node1
    App -->|"read/write"| Node2
    App -->|"read/write"| Node3

    style Node1 fill:#e1f5fe
    style Node2 fill:#e1f5fe
    style Node3 fill:#e1f5fe

Why Partition Data?

BenefitDescription
ScalabilityStore more data than fits on one machine
PerformanceDistribute load across multiple nodes
AvailabilityOne partition failure doesn't affect others

The Partitioning Challenge

The key question is: How do we decide which data goes on which node?

graph LR
    Key["user:12345"] --> Router{Partitioning<br/>Function}
    Router -->|"hash(key) % N"| N1[Node 1]
    Router --> N2[Node 2]
    Router --> N3[Node 3]

    style Router fill:#ff9,stroke:#333,stroke-width:3px

Partitioning Strategies

1. Hash-Based Partitioning

Apply a hash function to the key, then modulo the number of nodes:

node = hash(key) % number_of_nodes
graph TB
    subgraph "Hash-Based Partitioning (3 nodes)"
        Key1["user:alice"] --> H1["hash() % 3"]
        Key2["user:bob"] --> H2["hash() % 3"]
        Key3["user:carol"] --> H3["hash() % 3"]

        H1 -->|"= 1"| N1[Node 1]
        H2 -->|"= 2"| N2[Node 2]
        H3 -->|"= 0"| N0[Node 0]

        style N1 fill:#c8e6c9
        style N2 fill:#c8e6c9
        style N0 fill:#c8e6c9
    end

TypeScript Example:

function getNode(key: string, totalNodes: number): number {
    // Simple hash function
    let hash = 0;
    for (let i = 0; i < key.length; i++) {
        hash = ((hash << 5) - hash) + key.charCodeAt(i);
        hash = hash & hash; // Convert to 32bit integer
    }
    return Math.abs(hash) % totalNodes;
}

// Examples
console.log(getNode('user:alice', 3));  // => 1
console.log(getNode('user:bob', 3));    // => 2
console.log(getNode('user:carol', 3));  // => 0

Python Example:

def get_node(key: str, total_nodes: int) -> int:
    """Determine which node should store this key."""
    hash_value = hash(key)  # Built-in hash function
    return abs(hash_value) % total_nodes

# Examples
print(get_node('user:alice', 3))   # => 1
print(get_node('user:bob', 3))     # => 2
print(get_node('user:carol', 3))   # => 0

Advantages:

  • ✅ Even data distribution
  • ✅ Simple to implement
  • ✅ No hotspots (assuming good hash function)

Disadvantages:

  • ❌ Cannot do efficient range queries
  • ❌ Rebalancing is expensive when adding/removing nodes

2. Range-Based Partitioning

Assign key ranges to each node:

graph TB
    subgraph "Range-Based Partitioning (3 nodes)"
        R1["Node 1<br/>a-m"]
        R2["Node 2<br/>n-s"]
        R3["Node 3<br/>t-z"]

        Key1["alice"] --> R1
        Key2["bob"] --> R1
        Key3["nancy"] --> R2
        Key4["steve"] --> R2
        Key5["tom"] --> R3
        Key6["zoe"] --> R3

        style R1 fill:#c8e6c9
        style R2 fill:#c8e6c9
        style R3 fill:#c8e6c9
    end

TypeScript Example:

interface Range {
    start: string;
    end: string;
    node: number;
}

const ranges: Range[] = [
    { start: 'a', end: 'm', node: 1 },
    { start: 'n', end: 's', node: 2 },
    { start: 't', end: 'z', node: 3 }
];

function getNodeByRange(key: string): number {
    for (const range of ranges) {
        if (key >= range.start && key <= range.end) {
            return range.node;
        }
    }
    throw new Error(`No range found for key: ${key}`);
}

// Examples
console.log(getNodeByRange('alice'));  // => 1
console.log(getNodeByRange('nancy'));  // => 2
console.log(getNodeByRange('tom'));    // => 3

Python Example:

from typing import List, Tuple

ranges: List[Tuple[str, str, int]] = [
    ('a', 'm', 1),
    ('n', 's', 2),
    ('t', 'z', 3)
]

def get_node_by_range(key: str) -> int:
    """Determine which node based on key range."""
    for start, end, node in ranges:
        if start <= key <= end:
            return node
    raise ValueError(f"No range found for key: {key}")

# Examples
print(get_node_by_range('alice'))  # => 1
print(get_node_by_range('nancy'))  # => 2
print(get_node_by_range('tom'))    # => 3

Advantages:

  • ✅ Efficient range queries
  • ✅ Can optimize for data access patterns

Disadvantages:

  • ❌ Uneven distribution (hotspots)
  • ❌ Complex to load balance

The Rebalancing Problem

What happens when you add or remove nodes?

stateDiagram-v2
    [*] --> Stable: 3 Nodes
    Stable --> Rebalancing: Add Node 4
    Rebalancing --> Stable: Move 25% of data
    Stable --> Rebalancing: Remove Node 2
    Rebalancing --> Stable: Redistribute data

Simple Modulo Hashing Problem

With hash(key) % N, changing N from 3 to 4 means most keys move to different nodes:

Keyhash % 3hash % 4Moved?
user:111
user:222
user:303
user:410
user:521
user:602

75% of keys moved!

Consistent Hashing (Advanced)

A technique to minimize data movement when nodes change:

graph TB
    subgraph "Hash Ring"
        Ring["Virtual Ring (0 - 2^32)"]

        N1["Node 1<br/>position: 100"]
        N2["Node 2<br/>position: 500"]
        N3["Node 3<br/>position: 900"]

        K1["Key A<br/>hash: 150"]
        K2["Key B<br/>hash: 600"]
        K3["Key C<br/>hash: 950"]
    end

    Ring --> N1
    Ring --> N2
    Ring --> N3

    K1 -->|"clockwise"| N2
    K2 -->|"clockwise"| N3
    K3 -->|"clockwise"| N1

    style Ring fill:#f9f,stroke:#333,stroke-width:2px

Key Idea: Each key is assigned to the first node clockwise from its hash position.

When adding/removing a node, only keys in that node's range move.

Query Patterns and Partitioning

Your query patterns should influence your partitioning strategy:

Common Query Patterns

Query TypeBest PartitioningExample
Key-value lookupsHash-basedGet user by ID
Range scansRange-basedUsers registered last week
Multi-key accessComposite hashOrders by customer
Geographic queriesLocation-basedNearby restaurants

Example: User Data Partitioning

graph TB
    subgraph "Application: Social Network"
        Query1["Get User Profile<br/>SELECT * FROM users WHERE id = ?"]
        Query2["List Friends<br/>SELECT * FROM friends WHERE user_id = ?"]
        Query3["Timeline Posts<br/>SELECT * FROM posts WHERE created_at > ?"]
    end

    subgraph "Partitioning Decision"
        Query1 -->|"hash(user_id)"| Hash[Hash-Based]
        Query2 -->|"hash(user_id)"| Hash
        Query3 -->|"range(created_at)"| Range[Range-Based]
    end

    subgraph "Result"
        Hash --> H["User data & friends<br/>partitioned by user_id"]
        Range --> R["Posts partitioned<br/>by date range"]
    end

Trade-offs Summary

StrategyDistributionRange QueriesRebalancingComplexity
Hash-basedEvenPoorExpensiveLow
Range-basedPotentially unevenExcellentModerateMedium
Consistent hashingEvenPoorMinimalHigh

Real-World Examples

SystemPartitioning StrategyNotes
Redis ClusterHash slots (16384 slots)Consistent hashing
CassandraToken-aware (hash ring)Configurable partitioner
MongoDBShard key rangesRange-based on shard key
DynamoDBHash + range (composite)Supports composite keys
PostgreSQLNot nativeUse extensions like Citus

Summary

Key Takeaways

  1. Partitioning splits data across multiple nodes for scalability
  2. Hash-based gives even distribution but poor range queries
  3. Range-based enables range scans but can create hotspots
  4. Rebalancing is a key challenge when nodes change
  5. Query patterns should drive your partitioning strategy

Check Your Understanding

  • Why is hash-based partitioning better for even distribution?
  • When would you choose range-based over hash-based?
  • What happens to data placement when you add a new node with simple modulo hashing?
  • How does consistent hashing minimize data movement?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

Now that we understand how to partition data, let's explore the fundamental trade-offs in distributed data systems: CAP Theorem

CAP Theorem

Session 3, Part 2 - 30 minutes

Learning Objectives

  • Understand the CAP theorem and its three components
  • Explore the trade-offs between Consistency, Availability, and Partition tolerance
  • Identify real-world systems and their CAP choices
  • Learn how to apply CAP thinking to system design

What is the CAP Theorem?

The CAP theorem states that a distributed data store can only provide two of the following three guarantees:

graph TB
    subgraph "CAP Triangle - Pick Two"
        C["Consistency<br/>Every read receives<br/>the most recent write"]
        A["Availability<br/>Every request receives<br/>a response"]
        P["Partition Tolerance<br/>System operates<br/>despite network failures"]
    end

    C <--> A
    A <--> P
    P <--> C

    style C fill:#ffcdd2
    style A fill:#c8e6c9
    style P fill:#bbdefb

The Three Components

1. Consistency (C)

Every read receives the most recent write or an error.

All nodes see the same data at the same time. If you write a value and immediately read it, you get the value you just wrote.

sequenceDiagram
    participant C as Client
    participant N1 as Node 1
    participant N2 as Node 2
    participant N3 as Node 3

    C->>N1: Write X = 10
    N1->>N2: Replicate X
    N1->>N3: Replicate X
    N2-->>N1: Ack
    N3-->>N1: Ack
    N1-->>C: Write confirmed

    Note over C,N3: Before reading...

    C->>N2: Read X
    N2-->>C: X = 10 (latest)

    Note over C,N3: All nodes agree!

Example: A bank system where your balance must be accurate across all branches.

2. Availability (A)

Every request receives a (non-error) response, without the guarantee that it contains the most recent write.

The system remains operational even when some nodes fail. You can always read and write, even if the data might be stale.

sequenceDiagram
    participant C as Client
    participant N1 as Node 1 (alive)
    participant N2 as Node 2 (dead)

    C->>N1: Write X = 10
    N1-->>C: Write confirmed

    Note over C,N2: N2 is down but N1 responds...

    C->>N1: Read X
    N1-->>C: X = 10

    Note over C,N2: System stays available!

Example: A social media feed where showing slightly old content is acceptable.

3. Partition Tolerance (P)

The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.

Network partitions are inevitable in distributed systems. The system must handle them gracefully.

graph TB
    subgraph "Network Partition"
        N1["Node 1<br/>Can't reach N2, N3"]
        N2["Node 2<br/>Can't reach N1"]
        N3["Node 3<br/>Can't reach N1"]
    end

    N1 -.->|"🔴 Network Partition"| N2
    N1 -.->|"🔴 Network Partition"| N3
    N2 <--> N3
    N2 <--> N3

    style N1 fill:#ffcdd2
    style N2 fill:#c8e6c9
    style N3 fill:#c8e6c9

Key Insight: In distributed systems, P is not optional—network partitions WILL happen.

The Trade-offs

Since partitions are inevitable in distributed systems, the real choice is between C and A during a partition:

stateDiagram-v2
    [*] --> Normal
    Normal --> Partitioned: Network Split
    Partitioned --> CP: Choose Consistency
    Partitioned --> AP: Choose Availability
    CP --> Normal: Partition heals
    AP --> Normal: Partition heals

    note right of CP
        Reject writes/reads
        until data syncs
    end note

    note right of AP
        Accept writes/reads
        data may be stale
    end note

CP: Consistency + Partition Tolerance

Sacrifice Availability

During a partition, the system returns errors or blocks until consistency can be guaranteed.

sequenceDiagram
    participant C as Client
    participant N1 as Node 1 (primary)
    participant N2 as Node 2 (isolated)

    Note over N1,N2: 🔴 Network Partition

    C->>N1: Write X = 10
    N1-->>C: ❌ Error: Cannot replicate

    C->>N2: Read X
    N2-->>C: ❌ Error: Data unavailable

    Note over C,N2: System blocks rather<br/>than return stale data

Examples:

  • MongoDB (with majority write concern)
  • HBase
  • Redis (with proper configuration)
  • Traditional RDBMS with synchronous replication

Use when: Data accuracy is critical (financial systems, inventory)

AP: Availability + Partition Tolerance

Sacrifice Consistency

During a partition, the system accepts reads and writes, possibly returning stale data.

sequenceDiagram
    participant C as Client
    participant N1 as Node 1 (accepts writes)
    participant N2 as Node 2 (has old data)

    Note over N1,N2: 🔴 Network Partition

    C->>N1: Write X = 10
    N1-->>C: ✅ OK (written to N1 only)

    C->>N2: Read X
    N2-->>C: ✅ X = 5 (stale!)

    Note over C,N2: System accepts requests<br/>but data is inconsistent

Examples:

  • Cassandra
  • DynamoDB
  • CouchDB
  • Riak

Use when: Always responding is more important than immediate consistency (social media, caching, analytics)

CA: Consistency + Availability

Only possible in single-node systems

Without network partitions (single node or perfectly reliable network), you can have both C and A.

graph TB
    Single["Single Node Database"]
    Client["Client"]

    Client --> Single
    Single <--> Client

    Note1[No network = No partitions]
    Note --> Single

    style Single fill:#fff9c4

Examples:

  • Single-node PostgreSQL
  • Single-node MongoDB
  • Traditional RDBMS on one server

Reality: In distributed systems, CA is not achievable because networks are not perfectly reliable.

Real-World CAP Examples

SystemCAP ChoiceNotes
Google SpannerCPExternal consistency, always consistent
Amazon DynamoDBAPConfigurable consistency
CassandraAPAlways writable, tunable consistency
MongoDBCP (default)Configurable to AP
Redis ClusterAPAsync replication
PostgreSQLCASingle-node mode
CockroachDBCPSerializability, handles partitions
CouchbaseAPCross Data Center Replication

Consistency Models

The CAP theorem's "Consistency" is actually linearizability (strong consistency). There are many consistency models:

graph TB
    subgraph "Consistency Spectrum"
        Strong["Strong Consistency<br/>Linearizability"]
        Weak["Weak Consistency<br/>Eventual Consistency"]

        Strong --> S1["Sequential<br/>Consistency"]
        S1 --> S2["Causal<br/>Consistency"]
        S2 --> S3["Session<br/>Consistency"]
        S3 --> S4["Read Your<br/>Writes"]
        S4 --> Weak
    end

Strong Consistency Models

ModelDescriptionExample
LinearizableMost recent read guaranteedBank transfers
SequentialOperations appear in some orderVersion control
CausalCausally related operations orderedChat applications

Weak Consistency Models

ModelDescriptionExample
Read Your WritesUser sees their own writesSocial media profile
Session ConsistencyConsistency within a sessionShopping cart
Eventual ConsistencySystem converges over timeDNS, CDN

Practical Example: Shopping Cart

Let's see how different CAP choices affect a shopping cart system:

CP Approach (Block on Partition)

sequenceDiagram
    participant U as User
    participant S as Service

    Note over U,S: 🔴 Network partition detected

    U->>S: Add item to cart
    S-->>U: ❌ Error: Service unavailable

    Note over U,S: User frustrated,<br/>but cart is always accurate

Trade-off: Lost sales, accurate cart

AP Approach (Accept Writes)

sequenceDiagram
    participant U as User
    participant S as Service

    Note over U,S: 🔴 Network partition detected

    U->>S: Add item to cart
    S-->>U: ✅ OK (written locally)

    Note over U,S: User happy,<br/>but cart might conflict

Trade-off: Happy users, possible merge conflicts later

The "2 of 3" Simplification

The CAP theorem is often misunderstood. The reality is more nuanced:

graph TB
    subgraph "CAP Reality"
        CAP["CAP Theorem"]

        CAP --> Misconception["You must choose<br/>exactly 2"]
        CAP --> Reality["You can have all 3<br/>in normal operation"]
        CAP --> Truth["During partition,<br/>choose C or A"]
    end

Key Insights:

  1. P is mandatory in distributed systems
  2. During normal operation, you can have C + A + P
  3. During a partition, you choose between C and A
  4. Many systems are configurable (e.g., DynamoDB)

Design Guidelines

Choose CP When:

  • ✅ Financial transactions
  • ✅ Inventory management
  • ✅ Authentication/authorization
  • ✅ Any system where stale data is unacceptable

Choose AP When:

  • ✅ Social media feeds
  • ✅ Product recommendations
  • ✅ Analytics and logging
  • ✅ Any system where availability is critical

Techniques to Balance C and A:

TechniqueDescriptionExample
Quorum reads/writesRequire majority acknowledgmentDynamoDB
Tunable consistencyLet client choose per operationCassandra
Graceful degradationSwitch modes during partitionMany systems
Conflict resolutionMerge divergent data laterCRDTs

Summary

Key Takeaways

  1. CAP theorem: You can't have all three in a partition
  2. Partition tolerance is mandatory in distributed systems
  3. Real choice: Consistency vs Availability during partition
  4. Many systems offer tunable consistency levels
  5. Your use case determines the right trade-off

Check Your Understanding

  • Why is partition tolerance not optional in distributed systems?
  • Give an example where you would choose CP over AP
  • What happens to an AP system during a network partition?
  • How can quorum reads/writes help balance C and A?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

Now that we understand CAP trade-offs, let's build a simple key-value store: Store Basics

Store System Basics

Session 3, Part 3 - 35 minutes (coding demo + hands-on)

Learning Objectives

  • Understand the key-value store data model
  • Build a single-node key-value store in TypeScript
  • Build the same store in Python
  • Deploy and test the store using Docker Compose
  • Perform basic read/write operations via HTTP

What is a Key-Value Store?

A key-value store is the simplest type of database:

graph LR
    subgraph "Key-Value Store"
        KV[("Data Store")]

        K1["name"] --> V1[""Alice""]
        K2["age"] --> V2["30"]
        K3["city"] --> V3[""NYC""]
        K4["active"] --> V4["true"]

        K1 --> KV
        K2 --> KV
        K3 --> KV
        K4 --> KV
    end

Key Characteristics:

  • Simple data model: key → value
  • Fast lookups by key
  • No complex queries
  • Schema-less

Basic Operations

OperationDescriptionExample
SETStore a value for a keySET user:1 Alice
GETRetrieve a value by keyGET user:1 → "Alice"
DELETERemove a keyDELETE user:1
stateDiagram-v2
    [*] --> NotExists
    NotExists --> Exists: SET key
    Exists --> Exists: SET key (update)
    Exists --> NotExists: DELETE key
    Exists --> Exists: GET key (read)
    NotExists --> [*]: GET key (null)

Implementation

We'll build a simple HTTP-based key-value store with REST API endpoints.

API Design

GET    /key/{key}      - Get value by key
PUT    /key/{key}      - Set value for key
DELETE /key/{key}      - Delete key
GET    /keys           - List all keys

TypeScript Implementation

Project Structure

store-basics-ts/
├── package.json
├── tsconfig.json
├── Dockerfile
└── src/
    └── store.ts       # Complete store implementation

Complete TypeScript Code

store-basics-ts/src/store.ts

import http from 'http';

/**
 * Simple in-memory key-value store
 */
class KeyValueStore {
  private data: Map<string, any> = new Map();

  /**
   * Set a key-value pair
   */
  set(key: string, value: any): void {
    this.data.set(key, value);
    console.log(`[Store] SET ${key} = ${JSON.stringify(value)}`);
  }

  /**
   * Get a value by key
   */
  get(key: string): any {
    const value = this.data.get(key);
    console.log(`[Store] GET ${key} => ${value !== undefined ? JSON.stringify(value) : 'null'}`);
    return value;
  }

  /**
   * Delete a key
   */
  delete(key: string): boolean {
    const existed = this.data.delete(key);
    console.log(`[Store] DELETE ${key} => ${existed ? 'success' : 'not found'}`);
    return existed;
  }

  /**
   * Get all keys
   */
  keys(): string[] {
    return Array.from(this.data.keys());
  }

  /**
   * Get store statistics
   */
  stats() {
    return {
      totalKeys: this.data.size,
      keys: this.keys()
    };
  }
}

// Create the store instance
const store = new KeyValueStore();

/**
 * HTTP Server with key-value API
 */
const server = http.createServer((req, res) => {
  // Enable CORS
  res.setHeader('Access-Control-Allow-Origin', '*');
  res.setHeader('Access-Control-Allow-Methods', 'GET, PUT, DELETE, OPTIONS');
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');

  if (req.method === 'OPTIONS') {
    res.writeHead(200);
    res.end();
    return;
  }

  // Parse URL
  const url = new URL(req.url || '', `http://${req.headers.host}`);

  // Route: GET /keys - List all keys
  if (req.method === 'GET' && url.pathname === '/keys') {
    res.writeHead(200, { 'Content-Type': 'application/json' });
    res.end(JSON.stringify(store.stats()));
    return;
  }

  // Route: GET /key/{key} - Get value
  if (req.method === 'GET' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5); // Remove '/key/'
    const value = store.get(key);

    if (value !== undefined) {
      res.writeHead(200, { 'Content-Type': 'application/json' });
      res.end(JSON.stringify({ key, value }));
    } else {
      res.writeHead(404, { 'Content-Type': 'application/json' });
      res.end(JSON.stringify({ error: 'Key not found', key }));
    }
    return;
  }

  // Route: PUT /key/{key} - Set value
  if (req.method === 'PUT' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5); // Remove '/key/'

    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const value = JSON.parse(body);
        store.set(key, value);

        res.writeHead(200, { 'Content-Type': 'application/json' });
        res.end(JSON.stringify({ success: true, key, value }));
      } catch (error) {
        res.writeHead(400, { 'Content-Type': 'application/json' });
        res.end(JSON.stringify({ error: 'Invalid JSON' }));
      }
    });
    return;
  }

  // Route: DELETE /key/{key} - Delete key
  if (req.method === 'DELETE' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5); // Remove '/key/'
    const existed = store.delete(key);

    if (existed) {
      res.writeHead(200, { 'Content-Type': 'application/json' });
      res.end(JSON.stringify({ success: true, key }));
    } else {
      res.writeHead(404, { 'Content-Type': 'application/json' });
      res.end(JSON.stringify({ error: 'Key not found', key }));
    }
    return;
  }

  // 404 - Not found
  res.writeHead(404, { 'Content-Type': 'application/json' });
  res.end(JSON.stringify({ error: 'Not found' }));
});

const PORT = process.env.PORT || 4000;
server.listen(PORT, () => {
  console.log(`Key-Value Store listening on port ${PORT}`);
  console.log(`\nAvailable endpoints:`);
  console.log(`  GET    /key/{key}    - Get value by key`);
  console.log(`  PUT    /key/{key}    - Set value for key`);
  console.log(`  DELETE /key/{key}    - Delete key`);
  console.log(`  GET    /keys         - List all keys`);
});

store-basics-ts/package.json

{
  "name": "store-basics-ts",
  "version": "1.0.0",
  "description": "Simple key-value store in TypeScript",
  "main": "dist/store.js",
  "scripts": {
    "build": "tsc",
    "start": "node dist/store.js",
    "dev": "ts-node src/store.ts"
  },
  "dependencies": {},
  "devDependencies": {
    "@types/node": "^20.0.0",
    "typescript": "^5.0.0",
    "ts-node": "^10.9.0"
  }
}

store-basics-ts/tsconfig.json

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true
  },
  "include": ["src/**/*"]
}

store-basics-ts/Dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .
RUN npm run build

EXPOSE 4000

CMD ["npm", "start"]

Python Implementation

Project Structure

store-basics-py/
├── requirements.txt
├── Dockerfile
└── src/
    └── store.py       # Complete store implementation

Complete Python Code

store-basics-py/src/store.py

from http.server import HTTPServer, BaseHTTPRequestHandler
import json
from typing import Any, Dict
from urllib.parse import urlparse

class KeyValueStore:
    """Simple in-memory key-value store."""

    def __init__(self):
        self.data: Dict[str, Any] = {}

    def set(self, key: str, value: Any) -> None:
        """Store a key-value pair."""
        self.data[key] = value
        print(f"[Store] SET {key} = {json.dumps(value)}")

    def get(self, key: str) -> Any:
        """Get value by key."""
        value = self.data.get(key)
        print(f"[Store] GET {key} => {json.dumps(value) if value is not None else 'null'}")
        return value

    def delete(self, key: str) -> bool:
        """Delete a key."""
        existed = key in self.data
        if existed:
            del self.data[key]
        print(f"[Store] DELETE {key} => {'success' if existed else 'not found'}")
        return existed

    def keys(self) -> list:
        """Get all keys."""
        return list(self.data.keys())

    def stats(self) -> dict:
        """Get store statistics."""
        return {
            'totalKeys': len(self.data),
            'keys': self.keys()
        }


# Create the store instance
store = KeyValueStore()


class StoreHandler(BaseHTTPRequestHandler):
    """HTTP request handler for key-value store."""

    def send_json_response(self, status: int, data: dict):
        """Send a JSON response."""
        self.send_response(status)
        self.send_header('Content-Type', 'application/json')
        self.send_header('Access-Control-Allow-Origin', '*')
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

    def do_OPTIONS(self):
        """Handle CORS preflight requests."""
        self.send_response(200)
        self.send_header('Access-Control-Allow-Origin', '*')
        self.send_header('Access-Control-Allow-Methods', 'GET, PUT, DELETE, OPTIONS')
        self.send_header('Access-Control-Allow-Headers', 'Content-Type')
        self.end_headers()

    def do_GET(self):
        """Handle GET requests."""
        parsed = urlparse(self.path)

        # GET /keys - List all keys
        if parsed.path == '/keys':
            self.send_json_response(200, store.stats())
            return

        # GET /key/{key} - Get value
        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]  # Remove '/key/'
            value = store.get(key)

            if value is not None:
                self.send_json_response(200, {'key': key, 'value': value})
            else:
                self.send_json_response(404, {'error': 'Key not found', 'key': key})
            return

        # 404
        self.send_json_response(404, {'error': 'Not found'})

    def do_PUT(self):
        """Handle PUT requests (set value)."""
        parsed = urlparse(self.path)

        # PUT /key/{key} - Set value
        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]  # Remove '/key/'

            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')

            try:
                value = json.loads(body)
                store.set(key, value)
                self.send_json_response(200, {'success': True, 'key': key, 'value': value})
            except json.JSONDecodeError:
                self.send_json_response(400, {'error': 'Invalid JSON'})
            return

        # 404
        self.send_json_response(404, {'error': 'Not found'})

    def do_DELETE(self):
        """Handle DELETE requests."""
        parsed = urlparse(self.path)

        # DELETE /key/{key} - Delete key
        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]  # Remove '/key/'
            existed = store.delete(key)

            if existed:
                self.send_json_response(200, {'success': True, 'key': key})
            else:
                self.send_json_response(404, {'error': 'Key not found', 'key': key})
            return

        # 404
        self.send_json_response(404, {'error': 'Not found'})

    def log_message(self, format, *args):
        """Suppress default logging."""
        pass


def run_server(port: int = 4000):
    """Start the HTTP server."""
    server_address = ('', port)
    httpd = HTTPServer(server_address, StoreHandler)
    print(f"Key-Value Store listening on port {port}")
    print(f"\nAvailable endpoints:")
    print(f"  GET    /key/{{key}}    - Get value by key")
    print(f"  PUT    /key/{{key}}    - Set value for key")
    print(f"  DELETE /key/{{key}}    - Delete key")
    print(f"  GET    /keys         - List all keys")
    httpd.serve_forever()


if __name__ == '__main__':
    import os
    port = int(os.environ.get('PORT', 4000))
    run_server(port)

store-basics-py/requirements.txt

# No external dependencies required - uses standard library only

store-basics-py/Dockerfile

FROM python:3.11-alpine

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 4000

CMD ["python", "src/store.py"]

Docker Compose Setup

TypeScript Version

examples/02-store/ts/docker-compose.yml

version: '3.8'

services:
  store:
    build: .
    ports:
      - "4000:4000"
    environment:
      - PORT=4000
    volumes:
      - ./src:/app/src

Python Version

examples/02-store/py/docker-compose.yml

version: '3.8'

services:
  store:
    build: .
    ports:
      - "4000:4000"
    environment:
      - PORT=4000
    volumes:
      - ./src:/app/src

Running the Example

Step 1: Start the Store

TypeScript:

cd examples/02-store/ts
docker-compose up --build

Python:

cd examples/02-store/py
docker-compose up --build

You should see:

store    | Key-Value Store listening on port 4000
store    |
store    | Available endpoints:
store    |   GET    /key/{key}    - Get value by key
store    |   PUT    /key/{key}    - Set value for key
store    |   DELETE /key/{key}    - Delete key
store    |   GET    /keys         - List all keys

Step 2: Store Some Values

# Store a string
curl -X PUT http://localhost:4000/key/name \
  -H "Content-Type: application/json" \
  -d '"Alice"'

# Store a number
curl -X PUT http://localhost:4000/key/age \
  -H "Content-Type: application/json" \
  -d '30'

# Store an object
curl -X PUT http://localhost:4000/key/user:1 \
  -H "Content-Type: application/json" \
  -d '{"name": "Alice", "age": 30, "city": "NYC"}'

# Store a list
curl -X PUT http://localhost:4000/key/tags \
  -H "Content-Type: application/json" \
  -d '["distributed", "systems", "course"]'

Step 3: Retrieve Values

# Get a string
curl http://localhost:4000/key/name
# Response: {"key":"name","value":"Alice"}

# Get a number
curl http://localhost:4000/key/age
# Response: {"key":"age","value":30}

# Get an object
curl http://localhost:4000/key/user:1
# Response: {"key":"user:1","value":{"name":"Alice","age":30,"city":"NYC"}}

# Get a list
curl http://localhost:4000/key/tags
# Response: {"key":"tags","value":["distributed","systems","course"]}

# Try to get non-existent key
curl http://localhost:4000/key/nonexistent
# Response: {"error":"Key not found","key":"nonexistent"}

Step 4: List All Keys

curl http://localhost:4000/keys
# Response: {"totalKeys":4,"keys":["name","age","user:1","tags"]}

Step 5: Delete a Key

# Delete a key
curl -X DELETE http://localhost:4000/key/age
# Response: {"success":true,"key":"age"}

# Verify it's gone
curl http://localhost:4000/key/age
# Response: {"error":"Key not found","key":"age"}

# Check remaining keys
curl http://localhost:4000/keys
# Response: {"totalKeys":3,"keys":["name","user:1","tags"]}

System Architecture

graph TB
    subgraph "Single-Node Key-Value Store"
        Client["Client Applications"]

        API["HTTP API"]

        Store[("In-Memory<br/>Data Store")]

        Client -->|"GET/PUT/DELETE"| API
        API --> Store
    end

    style Store fill:#f9f,stroke:#333,stroke-width:3px

Exercises

Exercise 1: Add TTL (Time-To-Live) Support

Modify the store to automatically expire keys after a specified time:

  1. Add an optional ttl parameter to the SET operation
  2. Track when each key should expire
  3. Return null for expired keys
  4. Implement a cleanup mechanism

Hint: Store metadata alongside values, or use a separate expiration map.

Exercise 2: Add Key Patterns

Add wildcard support for key lookups:

  1. Implement GET /keys?pattern=user:* to list matching keys
  2. Support simple * wildcard matching
  3. Test with patterns like user:*, *:admin, etc.

Exercise 3: Add Data Persistence

Currently data is lost when the server restarts. Add persistence:

  1. Save data to a JSON file on every write
  2. Load data from file on startup
  3. Handle concurrent writes safely

Summary

Key Takeaways

  1. Key-value stores are simple but powerful data storage systems
  2. Basic operations: SET, GET, DELETE
  3. HTTP API provides a simple interface for remote access
  4. Single-node stores are CA (Consistent + Available) from CAP perspective
  5. Next steps: Add replication for fault tolerance (Session 4)

Check Your Understanding

  • What are the four basic operations we implemented?
  • How does our store handle requests for non-existent keys?
  • What happens to the data when the Docker container stops?
  • Why is this single-node store "CA" in CAP terms?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

Our simple store works, but what happens when the node fails? Let's add replication: Replication (Session 4)

Replication and Leader Election

Session 4 - Full session

Learning Objectives

  • Understand why we replicate data
  • Learn single-leader vs multi-leader replication
  • Implement leader-based replication
  • Build a simple leader election mechanism
  • Deploy a 3-node replicated store

Why Replicate Data?

In our single-node store from Session 3, what happens when the node fails?

Answer: All data is lost and the system becomes unavailable.

graph LR
    subgraph "Single Node - No Fault Tolerance"
        C[Clients] --> N[Node 1]
        N1[Node 1<br/>❌ FAILED]
        style N1 fill:#f66,stroke:#333,stroke-width:3px
    end

Replication solves this by keeping copies of data on multiple nodes:

graph TB
    subgraph "Replicated Store - Fault Tolerant"
        C[Clients]

        L[Leader<br/>Node 1]

        F1[Follower<br/>Node 2]
        F2[Follower<br/>Node 3]

        C --> L
        L -->|"replicate"| F1
        L -->|"replicate"| F2
    end

    style L fill:#6f6,stroke:#333,stroke-width:3px

Benefits of Replication:

  • Fault tolerance: If one node fails, others have the data
  • Read scaling: Clients can read from any replica
  • Low latency: Place replicas closer to users
  • High availability: System continues during node failures

Replication Strategies

Single-Leader Replication

Also called: primary-replica, master-slave, active-passive

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    Note over C,F2: Write Operation
    C->>L: PUT /key/name "Alice"
    L->>L: Write to local storage
    L->>F1: Replicate: SET name = "Alice"
    L->>F2: Replicate: SET name = "Alice"
    F1->>L: ACK
    F2->>L: ACK
    L->>C: Response: Success

    Note over C,F2: Read Operation
    C->>L: GET /key/name
    L->>C: Response: "Alice"

    Note over C,F2: Or read from follower
    C->>F1: GET /key/name
    F1->>C: Response: "Alice"

Characteristics:

  • Leader handles all writes
  • Followers replicate from leader
  • Reads can go to leader or followers
  • Simple consistency model

Multi-Leader Replication

Also called: multi-master, active-active

graph TB
    subgraph "Multi-Leader Replication"
        C1[Client 1]
        C2[Client 2]

        L1[Leader 1<br/>Datacenter A]
        L2[Leader 2<br/>Datacenter B]

        F1[Follower 1]
        F2[Follower 2]

        C1 --> L1
        C2 --> L2

        L1 <-->|"resolve conflicts"| L2

        L1 --> F1
        L2 --> F2
    end

    style L1 fill:#6f6,stroke:#333,stroke-width:3px
    style L2 fill:#6f6,stroke:#333,stroke-width:3px

Characteristics:

  • Multiple nodes accept writes
  • More complex conflict resolution
  • Better for geo-distributed setups
  • We won't implement this (advanced topic)

Synchronous vs Asynchronous Replication

sequenceDiagram
    participant C as Client
    participant L as Leader

    par Synchronous Replication
        L->>F: Replicate write
        F->>L: ACK (must wait)
        L->>C: Success (after replicas confirm)
    and Asynchronous Replication
        L->>C: Success (immediately)
        L--xF: Replicate in background
    end

    participant F as Follower
StrategyProsCons
SynchronousStrong consistency, no data lossSlower writes, blocking
AsynchronousFast writes, non-blockingData loss on leader failure, stale reads

For this course, we'll use asynchronous replication for simplicity.

Leader Election

When the leader fails, followers must elect a new leader:

stateDiagram-v2
    [*] --> Follower: Node starts
    Follower --> Candidate: No heartbeat from leader
    Candidate --> Leader: Wins election (majority votes)
    Candidate --> Follower: Loses election
    Leader --> Follower: Detects higher term/node
    Follower --> [*]: Node stops

The Bully Algorithm

A simple leader election algorithm:

  1. Detect leader failure: No heartbeat for timeout period
  2. Start election: Node with highest ID becomes leader candidate
  3. Vote: Lower-numbered nodes vote for the candidate
  4. Become leader: Candidate becomes leader if majority agrees
sequenceDiagram
    participant N1 as Node 1<br/>(Leader)
    participant N2 as Node 2
    participant N3 as Node 3

    Note over N1,N3: Normal Operation
    N1->>N2: Heartbeat
    N1->>N3: Heartbeat

    Note over N1,N3: Leader Fails
    N1--xN2: Heartbeat timeout!
    N1--xN3: Heartbeat timeout!

    Note over N2,N3: Election Starts
    N2->>N3: Vote request (ID=2)
    N3->>N2: Vote for N2 (2 > 3? No, wait)

    Note over N2,N3: Actually, N3 has higher ID
    N3->>N2: Vote request (ID=3)
    N2->>N3: Vote for N3 (3 > 2, yes!)

    Note over N2,N3: N3 Becomes Leader
    N3->>N2: I am the leader
    N3->>N2: Heartbeat

For simplicity, we'll use a simpler approach:

  • Lowest node ID becomes leader
  • If leader fails, next lowest becomes leader
  • No voting, just order-based selection

Implementation

TypeScript Implementation

Project Structure:

replicated-store-ts/
├── package.json
├── tsconfig.json
├── Dockerfile
├── docker-compose.yml
└── src/
    └── node.ts       # Replicated node with leader election

replicated-store-ts/src/node.ts

import http from 'http';

/**
 * Node configuration
 */
const config = {
  nodeId: process.env.NODE_ID || 'node-1',
  port: parseInt(process.env.PORT || '4000'),
  peers: (process.env.PEERS || '').split(',').filter(Boolean),
  heartbeatInterval: 2000,  // ms
  electionTimeout: 6000,     // ms
};

type NodeRole = 'leader' | 'follower' | 'candidate';

/**
 * Replicated Store Node
 */
class StoreNode {
  public nodeId: string;
  public role: NodeRole;
  public term: number;
  public data: Map<string, any>;
  public peers: string[];

  private leaderId: string | null;
  private lastHeartbeat: number;
  private heartbeatTimer?: NodeJS.Timeout;
  private electionTimer?: NodeJS.Timeout;

  constructor(nodeId: string, peers: string[]) {
    this.nodeId = nodeId;
    this.role = 'follower';
    this.term = 0;
    this.data = new Map();
    this.peers = peers;
    this.leaderId = null;
    this.lastHeartbeat = Date.now();

    this.startElectionTimer();
    this.startHeartbeat();
  }

  /**
   * Start election timeout timer
   */
  private startElectionTimer() {
    this.electionTimer = setTimeout(() => {
      const timeSinceHeartbeat = Date.now() - this.lastHeartbeat;
      if (timeSinceHeartbeat > config.electionTimeout && this.role !== 'leader') {
        console.log(`[${this.nodeId}] Election timeout! Starting election...`);
        this.startElection();
      }
      this.startElectionTimer();
    }, config.electionTimeout);
  }

  /**
   * Start leader election (simplified: lowest ID wins)
   */
  private startElection() {
    this.term++;
    this.role = 'candidate';

    // Simple strategy: lowest node ID becomes leader
    const allNodes = [this.nodeId, ...this.peers].sort();
    const lowestNode = allNodes[0];

    if (this.nodeId === lowestNode) {
      this.becomeLeader();
    } else {
      this.role = 'follower';
      this.leaderId = lowestNode;
      console.log(`[${this.nodeId}] Waiting for ${lowestNode} to become leader`);
    }
  }

  /**
   * Become the leader
   */
  private becomeLeader() {
    this.role = 'leader';
    this.leaderId = this.nodeId;
    console.log(`[${this.nodeId}] 👑 Became LEADER for term ${this.term}`);

    // Immediately replicate to followers
    this.replicateToFollowers();
  }

  /**
   * Start heartbeat to followers
   */
  private startHeartbeat() {
    this.heartbeatTimer = setInterval(() => {
      if (this.role === 'leader') {
        this.sendHeartbeat();
      }
    }, config.heartbeatInterval);
  }

  /**
   * Send heartbeat to all followers
   */
  private sendHeartbeat() {
    const heartbeat = {
      type: 'heartbeat',
      leaderId: this.nodeId,
      term: this.term,
      timestamp: Date.now(),
    };

    this.peers.forEach(peerUrl => {
      this.sendToPeer(peerUrl, '/internal/heartbeat', heartbeat)
        .catch(err => console.log(`[${this.nodeId}] Failed to send heartbeat to ${peerUrl}:`, err.message));
    });
  }

  /**
   * Replicate data to all followers
   */
  private replicateToFollowers() {
    // Convert Map to object for replication
    const dataObj = Object.fromEntries(this.data);

    this.peers.forEach(peerUrl => {
      this.sendToPeer(peerUrl, '/internal/replicate', {
        type: 'replicate',
        leaderId: this.nodeId,
        term: this.term,
        data: dataObj,
      }).catch(err => console.log(`[${this.nodeId}] Replication failed to ${peerUrl}:`, err.message));
    });
  }

  /**
   * Handle heartbeat from leader
   */
  handleHeartbeat(heartbeat: any) {
    if (heartbeat.term >= this.term) {
      this.term = heartbeat.term;
      this.lastHeartbeat = Date.now();
      this.leaderId = heartbeat.leaderId;
      this.role = 'follower';

      if (this.role !== 'follower') {
        console.log(`[${this.nodeId}] Stepping down to follower, term ${this.term}`);
      }
    }
  }

  /**
   * Handle replication from leader
   */
  handleReplication(message: any) {
    if (message.term >= this.term) {
      this.term = message.term;
      this.leaderId = message.leaderId;
      this.role = 'follower';
      this.lastHeartbeat = Date.now();

      // Merge replicated data
      Object.entries(message.data).forEach(([key, value]) => {
        this.data.set(key, value);
      });

      console.log(`[${this.nodeId}] Replicated ${Object.keys(message.data).length} keys from leader`);
    }
  }

  /**
   * Send data to peer node
   */
  private async sendToPeer(peerUrl: string, path: string, data: any): Promise<void> {
    return new Promise((resolve, reject) => {
      const url = new URL(path, peerUrl);
      const options = {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
      };

      const req = http.request(url, options, (res) => {
        if (res.statusCode === 200) {
          resolve();
        } else {
          reject(new Error(`Status ${res.statusCode}`));
        }
      });

      req.on('error', reject);
      req.write(JSON.stringify(data));
      req.end();
    });
  }

  /**
   * Set a key-value pair (only on leader)
   */
  set(key: string, value: any): boolean {
    if (this.role !== 'leader') {
      return false;
    }

    this.data.set(key, value);
    console.log(`[${this.nodeId}] SET ${key} = ${JSON.stringify(value)}`);

    // Replicate to followers
    this.replicateToFollowers();

    return true;
  }

  /**
   * Get a value by key
   */
  get(key: string): any {
    const value = this.data.get(key);
    console.log(`[${this.nodeId}] GET ${key} => ${value !== undefined ? JSON.stringify(value) : 'null'}`);
    return value;
  }

  /**
   * Delete a key
   */
  delete(key: string): boolean {
    if (this.role !== 'leader') {
      return false;
    }

    const existed = this.data.delete(key);
    console.log(`[${this.nodeId}] DELETE ${key} => ${existed ? 'success' : 'not found'}`);

    // Replicate to followers
    this.replicateToFollowers();

    return existed;
  }

  /**
   * Get node status
   */
  getStatus() {
    return {
      nodeId: this.nodeId,
      role: this.role,
      term: this.term,
      leaderId: this.leaderId,
      totalKeys: this.data.size,
      keys: Array.from(this.data.keys()),
    };
  }
}

// Create the node
const node = new StoreNode(config.nodeId, config.peers);

/**
 * HTTP Server
 */
const server = http.createServer((req, res) => {
  res.setHeader('Content-Type', 'application/json');
  res.setHeader('Access-Control-Allow-Origin', '*');
  res.setHeader('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS');
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');

  if (req.method === 'OPTIONS') {
    res.writeHead(200);
    res.end();
    return;
  }

  const url = new URL(req.url || '', `http://${req.headers.host}`);

  // Route: POST /internal/heartbeat - Leader heartbeat
  if (req.method === 'POST' && url.pathname === '/internal/heartbeat') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const heartbeat = JSON.parse(body);
        node.handleHeartbeat(heartbeat);
        res.writeHead(200);
        res.end(JSON.stringify({ success: true }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid request' }));
      }
    });
    return;
  }

  // Route: POST /internal/replicate - Replication from leader
  if (req.method === 'POST' && url.pathname === '/internal/replicate') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const message = JSON.parse(body);
        node.handleReplication(message);
        res.writeHead(200);
        res.end(JSON.stringify({ success: true }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid request' }));
      }
    });
    return;
  }

  // Route: GET /status - Node status
  if (req.method === 'GET' && url.pathname === '/status') {
    res.writeHead(200);
    res.end(JSON.stringify(node.getStatus()));
    return;
  }

  // Route: GET /key/{key} - Get value
  if (req.method === 'GET' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5);
    const value = node.get(key);

    if (value !== undefined) {
      res.writeHead(200);
      res.end(JSON.stringify({ key, value, nodeRole: node.role }));
    } else {
      res.writeHead(404);
      res.end(JSON.stringify({ error: 'Key not found', key }));
    }
    return;
  }

  // Route: PUT /key/{key} - Set value (leader only)
  if (req.method === 'PUT' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5);

    if (node.role !== 'leader') {
      res.writeHead(503);
      res.end(JSON.stringify({
        error: 'Not the leader',
        currentRole: node.role,
        leaderId: node.leaderId || 'Unknown',
      }));
      return;
    }

    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const value = JSON.parse(body);
        node.set(key, value);
        res.writeHead(200);
        res.end(JSON.stringify({ success: true, key, value, leaderId: node.nodeId }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid JSON' }));
      }
    });
    return;
  }

  // Route: DELETE /key/{key} - Delete key (leader only)
  if (req.method === 'DELETE' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5);

    if (node.role !== 'leader') {
      res.writeHead(503);
      res.end(JSON.stringify({
        error: 'Not the leader',
        currentRole: node.role,
        leaderId: node.leaderId || 'Unknown',
      }));
      return;
    }

    const existed = node.delete(key);
    if (existed) {
      res.writeHead(200);
      res.end(JSON.stringify({ success: true, key, leaderId: node.nodeId }));
    } else {
      res.writeHead(404);
      res.end(JSON.stringify({ error: 'Key not found', key }));
    }
    return;
  }

  // 404
  res.writeHead(404);
  res.end(JSON.stringify({ error: 'Not found' }));
});

server.listen(config.port, () => {
  console.log(`[${config.nodeId}] Store Node listening on port ${config.port}`);
  console.log(`[${config.nodeId}] Peers: ${config.peers.join(', ') || 'none'}`);
  console.log(`[${config.nodeId}] Available endpoints:`);
  console.log(`  GET  /status          - Node status and role`);
  console.log(`  GET  /key/{key}       - Get value`);
  console.log(`  PUT  /key/{key}       - Set value (leader only)`);
  console.log(`  DEL  /key/{key}       - Delete key (leader only)`);
});

replicated-store-ts/package.json

{
  "name": "replicated-store-ts",
  "version": "1.0.0",
  "description": "Replicated key-value store with leader election in TypeScript",
  "main": "dist/node.js",
  "scripts": {
    "build": "tsc",
    "start": "node dist/node.js",
    "dev": "ts-node src/node.ts"
  },
  "dependencies": {},
  "devDependencies": {
    "@types/node": "^20.0.0",
    "typescript": "^5.0.0",
    "ts-node": "^10.9.0"
  }
}

replicated-store-ts/tsconfig.json

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true
  },
  "include": ["src/**/*"]
}

replicated-store-ts/Dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .
RUN npm run build

EXPOSE 4000

CMD ["npm", "start"]

Python Implementation

replicated-store-py/src/node.py

import os
import json
import time
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Any, Dict, List, Optional
from urllib.parse import urlparse, parse_qs
from urllib.request import Request, urlopen
from urllib.error import URLError

class StoreNode:
    """Replicated store node with leader election."""

    def __init__(self, node_id: str, peers: List[str]):
        self.node_id = node_id
        self.role: str = 'follower'  # leader, follower, candidate
        self.term = 0
        self.data: Dict[str, Any] = {}
        self.peers = peers
        self.leader_id: Optional[str] = None
        self.last_heartbeat = time.time()

        # Configuration
        self.heartbeat_interval = 2.0  # seconds
        self.election_timeout = 6.0     # seconds

        # Start election timer
        self.start_election_timer()

        # Start heartbeat thread
        self.start_heartbeat_thread()

    def start_election_timer(self):
        """Start election timeout timer."""
        def election_timer():
            while True:
                time.sleep(1)
                time_since = time.time() - self.last_heartbeat
                if time_since > self.election_timeout and self.role != 'leader':
                    print(f"[{self.node_id}] Election timeout! Starting election...")
                    self.start_election()

        thread = threading.Thread(target=election_timer, daemon=True)
        thread.start()

    def start_election(self):
        """Start leader election (simplest: lowest ID wins)."""
        self.term += 1
        self.role = 'candidate'

        # Simple strategy: lowest node ID becomes leader
        all_nodes = sorted([self.node_id] + self.peers)
        lowest_node = all_nodes[0]

        if self.node_id == lowest_node:
            self.become_leader()
        else:
            self.role = 'follower'
            self.leader_id = lowest_node
            print(f"[{self.node_id}] Waiting for {lowest_node} to become leader")

    def become_leader(self):
        """Become the leader."""
        self.role = 'leader'
        self.leader_id = self.node_id
        print(f"[{self.node_id}] 👑 Became LEADER for term {self.term}")

        # Immediately replicate to followers
        self.replicate_to_followers()

    def start_heartbeat_thread(self):
        """Start heartbeat to followers."""
        def heartbeat_loop():
            while True:
                time.sleep(self.heartbeat_interval)
                if self.role == 'leader':
                    self.send_heartbeat()

        thread = threading.Thread(target=heartbeat_loop, daemon=True)
        thread.start()

    def send_heartbeat(self):
        """Send heartbeat to all followers."""
        heartbeat = {
            'type': 'heartbeat',
            'leader_id': self.node_id,
            'term': self.term,
            'timestamp': int(time.time() * 1000),
        }

        for peer in self.peers:
            try:
                self.send_to_peer(peer, '/internal/heartbeat', heartbeat)
            except Exception as e:
                print(f"[{self.node_id}] Failed to send heartbeat to {peer}: {e}")

    def replicate_to_followers(self):
        """Replicate data to all followers."""
        message = {
            'type': 'replicate',
            'leader_id': self.node_id,
            'term': self.term,
            'data': self.data,
        }

        for peer in self.peers:
            try:
                self.send_to_peer(peer, '/internal/replicate', message)
            except Exception as e:
                print(f"[{self.node_id}] Replication failed to {peer}: {e}")

    def handle_heartbeat(self, heartbeat: dict):
        """Handle heartbeat from leader."""
        if heartbeat['term'] >= self.term:
            self.term = heartbeat['term']
            self.last_heartbeat = time.time()
            self.leader_id = heartbeat['leader_id']

            if self.role != 'follower':
                print(f"[{self.node_id}] Stepping down to follower, term {self.term}")
            self.role = 'follower'

    def handle_replication(self, message: dict):
        """Handle replication from leader."""
        if message['term'] >= self.term:
            self.term = message['term']
            self.leader_id = message['leader_id']
            self.role = 'follower'
            self.last_heartbeat = time.time()

            # Merge replicated data
            self.data.update(message['data'])
            print(f"[{self.node_id}] Replicated {len(message['data'])} keys from leader")

    def send_to_peer(self, peer_url: str, path: str, data: dict) -> None:
        """Send data to peer node."""
        url = f"{peer_url}{path}"
        body = json.dumps(data).encode('utf-8')

        req = Request(url, data=body, headers={'Content-Type': 'application/json'}, method='POST')
        with urlopen(req, timeout=1) as response:
            if response.status != 200:
                raise Exception(f"Status {response.status}")

    def set(self, key: str, value: Any) -> bool:
        """Set a key-value pair (only on leader)."""
        if self.role != 'leader':
            return False

        self.data[key] = value
        print(f"[{self.node_id}] SET {key} = {json.dumps(value)}")

        # Replicate to followers
        self.replicate_to_followers()

        return True

    def get(self, key: str) -> Any:
        """Get a value by key."""
        value = self.data.get(key)
        print(f"[{self.node_id}] GET {key} => {json.dumps(value) if value is not None else 'null'}")
        return value

    def delete(self, key: str) -> bool:
        """Delete a key (only on leader)."""
        if self.role != 'leader':
            return False

        existed = key in self.data
        if existed:
            del self.data[key]

        print(f"[{self.node_id}] DELETE {key} => {'success' if existed else 'not found'}")

        # Replicate to followers
        self.replicate_to_followers()

        return existed

    def get_status(self) -> dict:
        """Get node status."""
        return {
            'node_id': self.node_id,
            'role': self.role,
            'term': self.term,
            'leader_id': self.leader_id,
            'total_keys': len(self.data),
            'keys': list(self.data.keys()),
        }


# Create the node
config = {
    'node_id': os.environ.get('NODE_ID', 'node-1'),
    'port': int(os.environ.get('PORT', '4000')),
    'peers': [p for p in os.environ.get('PEERS', '').split(',') if p],
}

node = StoreNode(config['node_id'], config['peers'])


class NodeHandler(BaseHTTPRequestHandler):
    """HTTP request handler for store node."""

    def send_json_response(self, status: int, data: dict):
        """Send a JSON response."""
        self.send_response(status)
        self.send_header('Content-Type', 'application/json')
        self.send_header('Access-Control-Allow-Origin', '*')
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

    def do_OPTIONS(self):
        """Handle CORS preflight."""
        self.send_response(200)
        self.send_header('Access-Control-Allow-Origin', '*')
        self.send_header('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS')
        self.send_header('Access-Control-Allow-Headers', 'Content-Type')
        self.end_headers()

    def do_POST(self):
        """Handle POST requests."""
        parsed = urlparse(self.path)

        # POST /internal/heartbeat
        if parsed.path == '/internal/heartbeat':
            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')

            try:
                heartbeat = json.loads(body)
                node.handle_heartbeat(heartbeat)
                self.send_json_response(200, {'success': True})
            except (json.JSONDecodeError, KeyError):
                self.send_json_response(400, {'error': 'Invalid request'})
            return

        # POST /internal/replicate
        if parsed.path == '/internal/replicate':
            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')

            try:
                message = json.loads(body)
                node.handle_replication(message)
                self.send_json_response(200, {'success': True})
            except (json.JSONDecodeError, KeyError):
                self.send_json_response(400, {'error': 'Invalid request'})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def do_GET(self):
        """Handle GET requests."""
        parsed = urlparse(self.path)

        # GET /status
        if parsed.path == '/status':
            self.send_json_response(200, node.get_status())
            return

        # GET /key/{key}
        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]  # Remove '/key/'
            value = node.get(key)

            if value is not None:
                self.send_json_response(200, {'key': key, 'value': value, 'node_role': node.role})
            else:
                self.send_json_response(404, {'error': 'Key not found', 'key': key})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def do_PUT(self):
        """Handle PUT requests (set value)."""
        parsed = urlparse(self.path)

        # PUT /key/{key}
        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]

            if node.role != 'leader':
                self.send_json_response(503, {
                    'error': 'Not the leader',
                    'current_role': node.role,
                    'leader_id': node.leader_id or 'Unknown',
                })
                return

            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')

            try:
                value = json.loads(body)
                node.set(key, value)
                self.send_json_response(200, {'success': True, 'key': key, 'value': value, 'leader_id': node.node_id})
            except json.JSONDecodeError:
                self.send_json_response(400, {'error': 'Invalid JSON'})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def do_DELETE(self):
        """Handle DELETE requests."""
        parsed = urlparse(self.path)

        # DELETE /key/{key}
        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]

            if node.role != 'leader':
                self.send_json_response(503, {
                    'error': 'Not the leader',
                    'current_role': node.role,
                    'leader_id': node.leader_id or 'Unknown',
                })
                return

            existed = node.delete(key)
            if existed:
                self.send_json_response(200, {'success': True, 'key': key, 'leader_id': node.node_id})
            else:
                self.send_json_response(404, {'error': 'Key not found', 'key': key})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def log_message(self, format, *args):
        """Suppress default logging."""
        pass


def run_server(port: int):
    """Start the HTTP server."""
    server_address = ('', port)
    httpd = HTTPServer(server_address, NodeHandler)
    print(f"[{config['node_id']}] Store Node listening on port {port}")
    print(f"[{config['node_id']}] Peers: {', '.join(config['peers']) or 'none'}")
    print(f"[{config['node_id']}] Available endpoints:")
    print(f"  GET  /status          - Node status and role")
    print(f"  GET  /key/{{key}}       - Get value")
    print(f"  PUT  /key/{{key}}       - Set value (leader only)")
    print(f"  DEL  /key/{{key}}       - Delete key (leader only)")
    httpd.serve_forever()


if __name__ == '__main__':
    run_server(config['port'])

replicated-store-py/requirements.txt

# No external dependencies - uses standard library only

replicated-store-py/Dockerfile

FROM python:3.11-alpine

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 4000

CMD ["python", "src/node.py"]

Docker Compose Setup

TypeScript Version

examples/02-store/ts/docker-compose.yml

version: '3.8'

services:
  node1:
    build: .
    container_name: store-ts-node1
    ports:
      - "4001:4000"
    environment:
      - NODE_ID=node-1
      - PORT=4000
      - PEERS=http://node2:4000,http://node3:4000
    networks:
      - store-network

  node2:
    build: .
    container_name: store-ts-node2
    ports:
      - "4002:4000"
    environment:
      - NODE_ID=node-2
      - PORT=4000
      - PEERS=http://node1:4000,http://node3:4000
    networks:
      - store-network

  node3:
    build: .
    container_name: store-ts-node3
    ports:
      - "4003:4000"
    environment:
      - NODE_ID=node-3
      - PORT=4000
      - PEERS=http://node1:4000,http://node2:4000
    networks:
      - store-network

networks:
  store-network:
    driver: bridge

Python Version

examples/02-store/py/docker-compose.yml

version: '3.8'

services:
  node1:
    build: .
    container_name: store-py-node1
    ports:
      - "4001:4000"
    environment:
      - NODE_ID=node-1
      - PORT=4000
      - PEERS=http://node2:4000,http://node3:4000
    networks:
      - store-network

  node2:
    build: .
    container_name: store-py-node2
    ports:
      - "4002:4000"
    environment:
      - NODE_ID=node-2
      - PORT=4000
      - PEERS=http://node1:4000,http://node3:4000
    networks:
      - store-network

  node3:
    build: .
    container_name: store-py-node3
    ports:
      - "4003:4000"
    environment:
      - NODE_ID=node-3
      - PORT=4000
      - PEERS=http://node1:4000,http://node2:4000
    networks:
      - store-network

networks:
  store-network:
    driver: bridge

Running the Example

Step 1: Start the 3-Node Cluster

TypeScript:

cd distributed-systems-course/examples/02-store/ts
docker-compose up --build

Python:

cd distributed-systems-course/examples/02-store/py
docker-compose up --build

You should see leader election happen automatically:

store-ts-node1 | [node-1] Store Node listening on port 4000
store-ts-node2 | [node-2] Store Node listening on port 4000
store-ts-node3 | [node-3] Store Node listening on port 4000
store-ts-node1 | [node-1] 👑 Became LEADER for term 1
store-ts-node2 | [node-2] Waiting for node-1 to become leader
store-ts-node3 | [node-3] Waiting for node-1 to become leader

Step 2: Check Node Status

# Check all nodes
curl http://localhost:4001/status
curl http://localhost:4002/status
curl http://localhost:4003/status

Response from node-1 (leader):

{
  "nodeId": "node-1",
  "role": "leader",
  "term": 1,
  "leaderId": "node-1",
  "totalKeys": 0,
  "keys": []
}

Response from node-2 (follower):

{
  "nodeId": "node-2",
  "role": "follower",
  "term": 1,
  "leaderId": "node-1",
  "totalKeys": 0,
  "keys": []
}

Step 3: Write to Leader

# Write to leader (node-1)
curl -X PUT http://localhost:4001/key/name \
  -H "Content-Type: application/json" \
  -d '"Alice"'

curl -X PUT http://localhost:4001/key/age \
  -H "Content-Type: application/json" \
  -d '30'

curl -X PUT http://localhost:4001/key/city \
  -H "Content-Type: application/json" \
  -d '"NYC"'

Response:

{
  "success": true,
  "key": "name",
  "value": "Alice",
  "leaderId": "node-1"
}

Step 4: Read from Followers

Data should be replicated to all followers:

curl http://localhost:4002/key/name
curl http://localhost:4003/key/city

Response:

{
  "key": "name",
  "value": "Alice",
  "nodeRole": "follower"
}

Step 5: Try Writing to Follower (Should Fail)

curl -X PUT http://localhost:4002/key/test \
  -H "Content-Type: application/json" \
  -d '"should fail"'

Response:

{
  "error": "Not the leader",
  "currentRole": "follower",
  "leaderId": "node-1"
}

Step 6: Simulate Leader Failure

# In a separate terminal, stop the leader
docker-compose stop node1

# Check node-2 status - should become new leader
curl http://localhost:4002/status

After a few seconds:

store-ts-node2 | [node-2] Election timeout! Starting election...
store-ts-node2 | [node-2] 👑 Became LEADER for term 2
store-ts-node3 | [node-3] Waiting for node-2 to become leader

Step 7: Write to New Leader

# Now node-2 is the leader
curl -X PUT http://localhost:4002/key/newleader \
  -H "Content-Type: application/json" \
  -d '"node-2"'

Step 8: Restart Old Leader

# Restart node-1
docker-compose start node1

# Check status - should become follower
curl http://localhost:4001/status

Response:

{
  "nodeId": "node-1",
  "role": "follower",
  "term": 2,
  "leaderId": "node-2",
  ...
}

System Architecture

graph TB
    subgraph "3-Node Replicated Store"
        Clients["Clients"]

        N1["Node 1<br/>👑 Leader"]
        N2["Node 2<br/>Follower"]
        N3["Node 3<br/>Follower"]

        Clients -->|"Write"| N1
        Clients -->|"Read"| N1
        Clients -->|"Read"| N2
        Clients -->|"Read"| N3

        N1 <-->|"Heartbeat<br/>Replication"| N2
        N1 <-->|"Heartbeat<br/>Replication"| N3
    end

    style N1 fill:#6f6,stroke:#333,stroke-width:3px

Exercises

Exercise 1: Test Fault Tolerance

  1. Start the cluster and write some data
  2. Stop different nodes one at a time
  3. Verify the system continues operating
  4. What happens when you stop 2 out of 3 nodes?

Exercise 2: Observe Replication Lag

  1. Add a small delay (e.g., 100ms) to replication
  2. Write data to leader
  3. Immediately read from follower
  4. What do you see? This demonstrates eventual consistency.

Exercise 3: Improve Leader Election

The current election is very simple. Try improving it:

  1. Add random election timeouts (like Raft)
  2. Implement actual voting (not just lowest ID)
  3. Add pre-vote to prevent disrupting current leader

Summary

Key Takeaways

  1. Replication copies data across multiple nodes for fault tolerance
  2. Single-leader replication is simple but all writes go through leader
  3. Leader election ensures a new leader is chosen when current leader fails
  4. Asynchronous replication is fast but can lose data on leader failure
  5. Read-your-writes consistency is NOT guaranteed when reading from followers

Trade-offs

ApproachProsCons
Single-leaderSimple, strong consistencyLeader is bottleneck, single point of failure
Multi-leaderNo bottleneck, writes anywhereComplex conflict resolution
Sync replicationNo data lossSlow writes, blocking
Async replicationFast writesData loss possible, stale reads

Check Your Understanding

  • Why do we replicate data?
  • What's the difference between leader and follower?
  • What happens when a client tries to write to a follower?
  • How does leader election work in our implementation?
  • What's the trade-off between sync and async replication?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

We have replication working, but our consistency model is basic. Let's explore consistency levels: Consistency Models (Session 5)

Consistency Models

Session 5 - Full session

Learning Objectives

  • Understand different consistency models in distributed systems
  • Learn the trade-offs between strong and eventual consistency
  • Implement configurable consistency levels in a replicated store
  • Experience the effects of consistency levels through hands-on exercises

What is Consistency?

In a replicated store, consistency defines what guarantees you have about the data you read. When data is copied across multiple nodes, you might not always see the latest write immediately.

graph TB
    subgraph "Write Happens"
        C[Client]
        L[Leader]
        L -->|Write "name = Alice"| L
    end

    subgraph "But What Do You Read?"
        F1[Follower 1<br/>name = Alice]
        F2[Follower 2<br/>name = ???]
        F3[Follower 3<br/>name = ???]

        C -->|Read| F1
        C -->|Read| F2
        C -->|Read| F3
    end

The question: If you read from a follower, will you see "Alice" or the old value?

The answer depends on your consistency model.

Consistency Spectrum

Consistency models exist on a spectrum from strongest to weakest:

graph LR
    A[Strong<br/>Consistency]
    B[Read Your Writes]
    C[Monotonic Reads]
    D[Causal Consistency]
    E[Eventual<br/>Consistency]

    A ====> B ====> C ====> D ====> E

    style A fill:#6f6
    style B fill:#9f6
    style C fill:#cf6
    style D fill:#ff6
    style E fill:#f96

Strong Consistency

Definition: Every read receives the most recent write or an error.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F as Follower

    Note over C,F: Time flows downward

    C->>L: SET name = "Alice"
    L->>L: Write confirmed

    Note over C,F: Strong consistency requires:
    Note over C,F: Waiting for replication...

    L->>F: Replicate: name = "Alice"
    F->>L: ACK

    L->>C: Response: Success

    C->>F: GET name
    F->>C: "Alice" (always latest!)

Characteristics:

  • Readers always see the latest data
  • No stale reads possible
  • Slower performance (must wait for replication)
  • Simple mental model

When to use: Financial transactions, inventory management, critical operations

Eventual Consistency

Definition: If no new updates are made, eventually all accesses will return the last updated value.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    Note over C,F2: Time flows downward

    C->>L: SET name = "Alice"
    L->>C: Response: Success (immediate!)

    Note over C,F2: Leader hasn't replicated yet...

    C->>F1: GET name
    F1->>C: "Alice" (replicated!)

    C->>F2: GET name
    F2->>C: "Bob" (stale!)

    Note over C,F2: A moment later...

    L->>F2: Replicate: name = "Alice"

    C->>F2: GET name
    F2->>C: "Alice" (updated!)

Characteristics:

  • Reads are fast (no waiting for replication)
  • You might see stale data
  • Eventually, all nodes converge
  • More complex mental model

When to use: Social media feeds, product recommendations, analytics

Read-Your-Writes Consistency

A middle ground: you always see your own writes, but might not see others' writes immediately.

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant L as Leader
    participant F as Follower

    C1->>L: SET name = "Alice"
    L->>C1: Success

    C1->>F: GET name
    Note over C1,F: Read-your-writes:<br/>C1 sees "Alice"
    F->>C1: "Alice"

    C2->>F: GET name
    Note over C2,F: Might see stale data
    F->>C2: "Bob" (stale!)

The CAP Theorem Revisited

You learned about CAP in Session 4. Let's connect it to consistency:

CombinationConsistency ModelExample Systems
CPStrong consistencyZooKeeper, etcd, MongoDB (with w:majority)
APEventual consistencyCassandra, DynamoDB, CouchDB
CA (impossible at scale)Strong consistencySingle-node databases (RDBMS)

Quorum-Based Consistency

A practical way to control consistency is using quorums. A quorum is a majority of nodes.

graph TB
    subgraph "3-Node Cluster"
        N1[Node 1]
        N2[Node 2]
        N3[Node 3]

        Q[Quorum = 2<br/>⌈3/2⌉ = 2]
    end

    N1 -.-> Q
    N2 -.-> Q
    N3 -.-> Q

    style Q fill:#6f6,stroke:#333,stroke-width:3px

Write Quorum (W)

Number of nodes that must acknowledge a write:

W > N/2  → Strong consistency (majority)
W = 1    → Fast but weak consistency
W = N    → Strongest but slowest

Read Quorum (R)

Number of nodes to query and compare for a read:

R + W > N  → Strong consistency guaranteed
R + W ≤ N  → Eventual consistency

Consistency Levels

R + WConsistencyPerformanceUse Case
N + 1 > N (impossible)StrongestSlowCritical data
R + W > NStrongMediumBanking, inventory
R + W ≤ NEventualFastSocial media, cache

Implementation

We'll extend our replicated store from Session 4 to support configurable consistency levels.

TypeScript Implementation

Project Structure:

consistent-store-ts/
├── package.json
├── tsconfig.json
├── Dockerfile
├── docker-compose.yml
└── src/
    └── node.ts       # Node with configurable consistency

consistent-store-ts/src/node.ts

import http from 'http';

/**
 * Node configuration
 */
const config = {
  nodeId: process.env.NODE_ID || 'node-1',
  port: parseInt(process.env.PORT || '4000'),
  peers: (process.env.PEERS || '').split(',').filter(Boolean),
  heartbeatInterval: 2000,
  electionTimeout: 6000,

  // Consistency settings
  writeQuorum: parseInt(process.env.WRITE_QUORUM || '2'),  // W
  readQuorum: parseInt(process.env.READ_QUORUM || '1'),    // R
};

type NodeRole = 'leader' | 'follower' | 'candidate';
type ConsistencyLevel = 'strong' | 'eventual' | 'read_your_writes';

/**
 * Replicated Store Node with Configurable Consistency
 */
class StoreNode {
  public nodeId: string;
  public role: NodeRole;
  public term: number;
  public data: Map<string, any>;
  public peers: string[];

  private leaderId: string | null;
  private lastHeartbeat: number;
  private heartbeatTimer?: NodeJS.Timeout;
  private electionTimer?: NodeJS.Timeout;
  private pendingWrites: Map<string, any[]>;  // For read-your-writes

  constructor(nodeId: string, peers: string[]) {
    this.nodeId = nodeId;
    this.role = 'follower';
    this.term = 0;
    this.data = new Map();
    this.peers = peers;
    this.leaderId = null;
    this.lastHeartbeat = Date.now();
    this.pendingWrites = new Map();

    this.startElectionTimer();
    this.startHeartbeat();
  }

  /**
   * Start election timeout timer
   */
  private startElectionTimer() {
    this.electionTimer = setTimeout(() => {
      const timeSinceHeartbeat = Date.now() - this.lastHeartbeat;
      if (timeSinceHeartbeat > config.electionTimeout && this.role !== 'leader') {
        console.log(`[${this.nodeId}] Election timeout! Starting election...`);
        this.startElection();
      }
      this.startElectionTimer();
    }, config.electionTimeout);
  }

  /**
   * Start leader election
   */
  private startElection() {
    this.term++;
    this.role = 'candidate';

    const allNodes = [this.nodeId, ...this.peers].sort();
    const lowestNode = allNodes[0];

    if (this.nodeId === lowestNode) {
      this.becomeLeader();
    } else {
      this.role = 'follower';
      this.leaderId = lowestNode;
      console.log(`[${this.nodeId}] Waiting for ${lowestNode} to become leader`);
    }
  }

  /**
   * Become the leader
   */
  private becomeLeader() {
    this.role = 'leader';
    this.leaderId = this.nodeId;
    console.log(`[${this.nodeId}] 👑 Became LEADER for term ${this.term}`);
    this.replicateToFollowers();
  }

  /**
   * Start heartbeat to followers
   */
  private startHeartbeat() {
    this.heartbeatTimer = setInterval(() => {
      if (this.role === 'leader') {
        this.sendHeartbeat();
      }
    }, config.heartbeatInterval);
  }

  /**
   * Send heartbeat to all followers
   */
  private sendHeartbeat() {
    const heartbeat = {
      type: 'heartbeat',
      leaderId: this.nodeId,
      term: this.term,
      timestamp: Date.now(),
    };

    this.peers.forEach(peerUrl => {
      this.sendToPeer(peerUrl, '/internal/heartbeat', heartbeat)
        .catch(err => console.log(`[${this.nodeId}] Failed heartbeat to ${peerUrl}`));
    });
  }

  /**
   * Replicate data to followers with quorum acknowledgment
   */
  private async replicateToFollowers(): Promise<boolean> {
    const dataObj = Object.fromEntries(this.data);

    // Send to all followers in parallel
    const promises = this.peers.map(peerUrl =>
      this.sendToPeer(peerUrl, '/internal/replicate', {
        type: 'replicate',
        leaderId: this.nodeId,
        term: this.term,
        data: dataObj,
      }).catch(err => {
        console.log(`[${this.nodeId}] Replication failed to ${peerUrl}`);
        return false;
      })
    );

    // Wait for all to complete
    const results = await Promise.all(promises);

    // Count successes (this node counts as 1)
    const successes = results.filter(r => r !== false).length + 1;

    // Check if we achieved write quorum
    const achievedQuorum = successes >= config.writeQuorum;
    console.log(`[${this.nodeId}] Replication: ${successes}/${this.peers.length + 1} nodes (W=${config.writeQuorum})`);

    return achievedQuorum;
  }

  /**
   * Handle heartbeat from leader
   */
  handleHeartbeat(heartbeat: any) {
    if (heartbeat.term >= this.term) {
      this.term = heartbeat.term;
      this.lastHeartbeat = Date.now();
      this.leaderId = heartbeat.leaderId;
      if (this.role !== 'follower') {
        this.role = 'follower';
      }
    }
  }

  /**
   * Handle replication from leader
   */
  handleReplication(message: any) {
    if (message.term >= this.term) {
      this.term = message.term;
      this.leaderId = message.leaderId;
      this.role = 'follower';
      this.lastHeartbeat = Date.now();

      Object.entries(message.data).forEach(([key, value]) => {
        this.data.set(key, value);
      });
    }
  }

  /**
   * Send data to peer node
   */
  private async sendToPeer(peerUrl: string, path: string, data: any): Promise<void> {
    return new Promise((resolve, reject) => {
      const url = new URL(path, peerUrl);
      const options = {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
      };

      const req = http.request(url, options, (res) => {
        if (res.statusCode === 200) {
          resolve();
        } else {
          reject(new Error(`Status ${res.statusCode}`));
        }
      });

      req.on('error', reject);
      req.write(JSON.stringify(data));
      req.end();
    });
  }

  /**
   * Set a key-value pair with quorum acknowledgment
   */
  async set(key: string, value: any): Promise<{ success: boolean; achievedQuorum: boolean }> {
    if (this.role !== 'leader') {
      return { success: false, achievedQuorum: false };
    }

    this.data.set(key, value);
    console.log(`[${this.nodeId}] SET ${key} = ${JSON.stringify(value)}`);

    // Replicate to followers
    const achievedQuorum = await this.replicateToFollowers();

    return { success: true, achievedQuorum };
  }

  /**
   * Get a value with configurable consistency
   */
  async get(key: string, consistency: ConsistencyLevel = 'eventual'): Promise<any> {
    const localValue = this.data.get(key);

    // For eventual consistency, return local value immediately
    if (consistency === 'eventual') {
      console.log(`[${this.nodeId}] GET ${key} => ${JSON.stringify(localValue)} (eventual)`);
      return localValue;
    }

    // For strong consistency, query quorum of nodes
    if (consistency === 'strong') {
      const values = await this.getFromQuorum(key);
      console.log(`[${this.nodeId}] GET ${key} => ${JSON.stringify(values.latest)} (strong from ${values.responses} nodes)`);
      return values.latest;
    }

    // For read-your-writes, check pending writes
    if (consistency === 'read_your_writes') {
      const pending = this.pendingWrites.get(key);
      const valueToReturn = pending && pending.length > 0 ? pending[pending.length - 1] : localValue;
      console.log(`[${this.nodeId}] GET ${key} => ${JSON.stringify(valueToReturn)} (read-your-writes)`);
      return valueToReturn;
    }

    return localValue;
  }

  /**
   * Query quorum of nodes and return most recent value
   */
  private async getFromQuorum(key: string): Promise<{ latest: any; responses: number }> {
    // Query all peers
    const promises = this.peers.map(peerUrl =>
      this.queryPeer(peerUrl, '/internal/get', { key })
        .then(result => ({ success: true, value: result.value, version: result.version || 0 }))
        .catch(err => {
          console.log(`[${this.nodeId}] Query failed to ${peerUrl}`);
          return { success: false, value: null, version: 0 };
        })
    );

    const results = await Promise.all(promises);

    // Add local value
    results.push({
      success: true,
      value: this.data.get(key),
      version: this.data.has(key) ? 1 : 0,
    });

    // Count successful responses
    const successful = results.filter(r => r.success);

    // Return if we have read quorum
    if (successful.length >= config.readQuorum) {
      // Return most recent value (simple version: first non-null)
      const latest = successful.find(r => r.value !== undefined)?.value;
      return { latest, responses: successful.length };
    }

    // Fallback to local value
    return { latest: this.data.get(key), responses: successful.length };
  }

  /**
   * Query a peer for a key
   */
  private async queryPeer(peerUrl: string, path: string, data: any): Promise<any> {
    return new Promise((resolve, reject) => {
      const url = new URL(path, peerUrl);
      const options = {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
      };

      const req = http.request(url, options, (res) => {
        let body = '';
        res.on('data', chunk => body += chunk);
        res.on('end', () => {
          if (res.statusCode === 200) {
            resolve(JSON.parse(body));
          } else {
            reject(new Error(`Status ${res.statusCode}`));
          }
        });
      });

      req.on('error', reject);
      req.write(JSON.stringify(data));
      req.end();
    });
  }

  /**
   * Delete a key
   */
  async delete(key: string): Promise<{ success: boolean; achievedQuorum: boolean }> {
    if (this.role !== 'leader') {
      return { success: false, achievedQuorum: false };
    }

    const existed = this.data.delete(key);
    console.log(`[${this.nodeId}] DELETE ${key}`);

    await this.replicateToFollowers();

    return { success: existed, achievedQuorum: true };
  }

  /**
   * Get node status
   */
  getStatus() {
    return {
      nodeId: this.nodeId,
      role: this.role,
      term: this.term,
      leaderId: this.leaderId,
      totalKeys: this.data.size,
      keys: Array.from(this.data.keys()),
      config: {
        writeQuorum: config.writeQuorum,
        readQuorum: config.readQuorum,
        totalNodes: this.peers.length + 1,
      },
    };
  }
}

// Create the node
const node = new StoreNode(config.nodeId, config.peers);

/**
 * HTTP Server
 */
const server = http.createServer((req, res) => {
  res.setHeader('Content-Type', 'application/json');
  res.setHeader('Access-Control-Allow-Origin', '*');
  res.setHeader('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS');
  res.setHeader('Access-Control-Allow-Headers', 'Content-Type');

  if (req.method === 'OPTIONS') {
    res.writeHead(200);
    res.end();
    return;
  }

  const url = new URL(req.url || '', `http://${req.headers.host}`);

  // Route: POST /internal/heartbeat
  if (req.method === 'POST' && url.pathname === '/internal/heartbeat') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const heartbeat = JSON.parse(body);
        node.handleHeartbeat(heartbeat);
        res.writeHead(200);
        res.end(JSON.stringify({ success: true }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid request' }));
      }
    });
    return;
  }

  // Route: POST /internal/replicate
  if (req.method === 'POST' && url.pathname === '/internal/replicate') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const message = JSON.parse(body);
        node.handleReplication(message);
        res.writeHead(200);
        res.end(JSON.stringify({ success: true }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid request' }));
      }
    });
    return;
  }

  // Route: POST /internal/get - Internal query for quorum reads
  if (req.method === 'POST' && url.pathname === '/internal/get') {
    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const { key } = JSON.parse(body);
        const value = node.data.get(key);
        res.writeHead(200);
        res.end(JSON.stringify({ value, version: value !== undefined ? 1 : 0 }));
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid request' }));
      }
    });
    return;
  }

  // Route: GET /status
  if (req.method === 'GET' && url.pathname === '/status') {
    res.writeHead(200);
    res.end(JSON.stringify(node.getStatus()));
    return;
  }

  // Route: GET /key/{key}?consistency=strong|eventual|read_your_writes
  if (req.method === 'GET' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5);
    const consistency = (url.searchParams.get('consistency') || 'eventual') as ConsistencyLevel;

    node.get(key, consistency).then(value => {
      if (value !== undefined) {
        res.writeHead(200);
        res.end(JSON.stringify({ key, value, nodeRole: node.role, consistency }));
      } else {
        res.writeHead(404);
        res.end(JSON.stringify({ error: 'Key not found', key }));
      }
    });
    return;
  }

  // Route: PUT /key/{key}
  if (req.method === 'PUT' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5);

    if (node.role !== 'leader') {
      res.writeHead(503);
      res.end(JSON.stringify({
        error: 'Not the leader',
        currentRole: node.role,
        leaderId: node.leaderId || 'Unknown',
      }));
      return;
    }

    let body = '';
    req.on('data', chunk => body += chunk);
    req.on('end', () => {
      try {
        const value = JSON.parse(body);
        node.set(key, value).then(result => {
          res.writeHead(200);
          res.end(JSON.stringify({
            success: result.success,
            key,
            value,
            leaderId: node.nodeId,
            achievedQuorum: result.achievedQuorum,
            writeQuorum: config.writeQuorum,
          }));
        });
      } catch (error) {
        res.writeHead(400);
        res.end(JSON.stringify({ error: 'Invalid JSON' }));
      }
    });
    return;
  }

  // Route: DELETE /key/{key}
  if (req.method === 'DELETE' && url.pathname.startsWith('/key/')) {
    const key = url.pathname.slice(5);

    if (node.role !== 'leader') {
      res.writeHead(503);
      res.end(JSON.stringify({
        error: 'Not the leader',
        currentRole: node.role,
        leaderId: node.leaderId || 'Unknown',
      }));
      return;
    }

    node.delete(key).then(result => {
      if (result.success) {
        res.writeHead(200);
        res.end(JSON.stringify({ success: true, key, leaderId: node.nodeId }));
      } else {
        res.writeHead(404);
        res.end(JSON.stringify({ error: 'Key not found', key }));
      }
    });
    return;
  }

  // 404
  res.writeHead(404);
  res.end(JSON.stringify({ error: 'Not found' }));
});

server.listen(config.port, () => {
  console.log(`[${config.nodeId}] Consistent Store listening on port ${config.port}`);
  console.log(`[${config.nodeId}] Write Quorum (W): ${config.writeQuorum}, Read Quorum (R): ${config.readQuorum}`);
  console.log(`[${config.nodeId}] Peers: ${config.peers.join(', ') || 'none'}`);
  console.log(`[${config.nodeId}] Available endpoints:`);
  console.log(`  GET  /status                         - Node status`);
  console.log(`  GET  /key/{key}?consistency=level   - Get with consistency level`);
  console.log(`  PUT  /key/{key}                      - Set value (leader only)`);
  console.log(`  DEL  /key/{key}                      - Delete key (leader only)`);
});

consistent-store-ts/package.json

{
  "name": "consistent-store-ts",
  "version": "1.0.0",
  "description": "Replicated key-value store with configurable consistency",
  "main": "dist/node.js",
  "scripts": {
    "build": "tsc",
    "start": "node dist/node.js",
    "dev": "ts-node src/node.ts"
  },
  "dependencies": {},
  "devDependencies": {
    "@types/node": "^20.0.0",
    "typescript": "^5.0.0",
    "ts-node": "^10.9.0"
  }
}

consistent-store-ts/tsconfig.json

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true
  },
  "include": ["src/**/*"]
}

consistent-store-ts/Dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .
RUN npm run build

EXPOSE 4000

CMD ["npm", "start"]

Python Implementation

consistent-store-py/src/node.py

import os
import json
import time
import threading
import asyncio
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Any, Dict, List, Optional, Literal
from urllib.parse import urlparse, parse_qs
from urllib.request import Request, urlopen
from urllib.error import URLError

ConsistencyLevel = Literal['strong', 'eventual', 'read_your_writes']

class StoreNode:
    """Replicated store node with configurable consistency."""

    def __init__(self, node_id: str, peers: List[str]):
        self.node_id = node_id
        self.role: str = 'follower'
        self.term = 0
        self.data: Dict[str, Any] = {}
        self.peers = peers
        self.leader_id: Optional[str] = None
        self.last_heartbeat = time.time()
        self.pending_writes: Dict[str, List[Any]] = {}

        # Configuration
        self.heartbeat_interval = 2.0
        self.election_timeout = 6.0
        self.write_quorum = int(os.environ.get('WRITE_QUORUM', '2'))
        self.read_quorum = int(os.environ.get('READ_QUORUM', '1'))

        # Start timers
        self.start_election_timer()
        self.start_heartbeat_thread()

    def start_election_timer(self):
        """Start election timeout timer."""
        def election_timer():
            while True:
                time.sleep(1)
                time_since = time.time() - self.last_heartbeat
                if time_since > self.election_timeout and self.role != 'leader':
                    print(f"[{self.node_id}] Election timeout! Starting election...")
                    self.start_election()

        thread = threading.Thread(target=election_timer, daemon=True)
        thread.start()

    def start_election(self):
        """Start leader election."""
        self.term += 1
        self.role = 'candidate'

        all_nodes = sorted([self.node_id] + self.peers)
        lowest_node = all_nodes[0]

        if self.node_id == lowest_node:
            self.become_leader()
        else:
            self.role = 'follower'
            self.leader_id = lowest_node
            print(f"[{self.node_id}] Waiting for {lowest_node} to become leader")

    def become_leader(self):
        """Become the leader."""
        self.role = 'leader'
        self.leader_id = self.node_id
        print(f"[{self.node_id}] 👑 Became LEADER for term {self.term}")
        self.replicate_to_followers()

    def start_heartbeat_thread(self):
        """Start heartbeat to followers."""
        def heartbeat_loop():
            while True:
                time.sleep(self.heartbeat_interval)
                if self.role == 'leader':
                    self.send_heartbeat()

        thread = threading.Thread(target=heartbeat_loop, daemon=True)
        thread.start()

    def send_heartbeat(self):
        """Send heartbeat to all followers."""
        heartbeat = {
            'type': 'heartbeat',
            'leader_id': self.node_id,
            'term': self.term,
            'timestamp': int(time.time() * 1000),
        }

        for peer in self.peers:
            try:
                self.send_to_peer(peer, '/internal/heartbeat', heartbeat)
            except Exception as e:
                print(f"[{self.node_id}] Failed heartbeat to {peer}: {e}")

    def replicate_to_followers(self) -> bool:
        """Replicate data to followers and check quorum."""
        message = {
            'type': 'replicate',
            'leader_id': self.node_id,
            'term': self.term,
            'data': self.data,
        }

        successes = 1  # This node counts

        for peer in self.peers:
            try:
                self.send_to_peer(peer, '/internal/replicate', message)
                successes += 1
            except Exception as e:
                print(f"[{self.node_id}] Replication failed to {peer}: {e}")

        achieved_quorum = successes >= self.write_quorum
        print(f"[{self.node_id}] Replication: {successes}/{len(self.peers) + 1} nodes (W={self.write_quorum})")

        return achieved_quorum

    def handle_heartbeat(self, heartbeat: dict):
        """Handle heartbeat from leader."""
        if heartbeat['term'] >= self.term:
            self.term = heartbeat['term']
            self.last_heartbeat = time.time()
            self.leader_id = heartbeat['leader_id']
            if self.role != 'follower':
                self.role = 'follower'

    def handle_replication(self, message: dict):
        """Handle replication from leader."""
        if message['term'] >= self.term:
            self.term = message['term']
            self.leader_id = message['leader_id']
            self.role = 'follower'
            self.last_heartbeat = time.time()
            self.data.update(message['data'])

    def send_to_peer(self, peer_url: str, path: str, data: dict) -> None:
        """Send data to peer node."""
        url = f"{peer_url}{path}"
        body = json.dumps(data).encode('utf-8')

        req = Request(url, data=body, headers={'Content-Type': 'application/json'}, method='POST')
        with urlopen(req, timeout=1) as response:
            if response.status != 200:
                raise Exception(f"Status {response.status}")

    def set(self, key: str, value: Any) -> Dict[str, Any]:
        """Set a key-value pair with quorum acknowledgment."""
        if self.role != 'leader':
            return {'success': False, 'achieved_quorum': False}

        self.data[key] = value
        print(f"[{self.node_id}] SET {key} = {json.dumps(value)}")

        achieved_quorum = self.replicate_to_followers()

        return {'success': True, 'achieved_quorum': achieved_quorum}

    def get(self, key: str, consistency: ConsistencyLevel = 'eventual') -> Any:
        """Get a value with configurable consistency."""
        local_value = self.data.get(key)

        if consistency == 'eventual':
            print(f"[{self.node_id}] GET {key} => {json.dumps(local_value)} (eventual)")
            return local_value

        if consistency == 'strong':
            latest, responses = self.get_from_quorum(key)
            print(f"[{self.node_id}] GET {key} => {json.dumps(latest)} (strong from {responses} nodes)")
            return latest

        if consistency == 'read_your_writes':
            pending = self.pending_writes.get(key, [])
            value_to_return = pending[-1] if pending else local_value
            print(f"[{self.node_id}] GET {key} => {json.dumps(value_to_return)} (read-your-writes)")
            return value_to_return

        return local_value

    def get_from_quorum(self, key: str) -> tuple:
        """Query quorum of nodes and return most recent value."""
        results = []

        # Query all peers
        for peer in self.peers:
            try:
                result = self.query_peer(peer, '/internal/get', {'key': key})
                results.append({
                    'success': True,
                    'value': result.get('value'),
                    'version': result.get('version', 0),
                })
            except Exception as e:
                print(f"[{self.node_id}] Query failed to {peer}: {e}")
                results.append({'success': False, 'value': None, 'version': 0})

        # Add local value
        results.append({
            'success': True,
            'value': self.data.get(key),
            'version': 1 if key in self.data else 0,
        })

        # Filter successful responses
        successful = [r for r in results if r['success']]

        if len(successful) >= self.read_quorum:
            # Return first non-null value
            for r in successful:
                if r['value'] is not None:
                    return r['value'], len(successful)

        return self.data.get(key), len(successful)

    def query_peer(self, peer_url: str, path: str, data: dict) -> dict:
        """Query a peer for a key."""
        url = f"{peer_url}{path}"
        body = json.dumps(data).encode('utf-8')

        req = Request(url, data=body, headers={'Content-Type': 'application/json'}, method='POST')
        with urlopen(req, timeout=1) as response:
            if response.status == 200:
                return json.loads(response.read().decode('utf-8'))
            raise Exception(f"Status {response.status}")

    def delete(self, key: str) -> Dict[str, Any]:
        """Delete a key."""
        if self.role != 'leader':
            return {'success': False, 'achieved_quorum': False}

        existed = key in self.data
        if existed:
            del self.data[key]

        print(f"[{self.node_id}] DELETE {key}")
        self.replicate_to_followers()

        return {'success': existed, 'achieved_quorum': True}

    def get_status(self) -> dict:
        """Get node status."""
        return {
            'node_id': self.node_id,
            'role': self.role,
            'term': self.term,
            'leader_id': self.leader_id,
            'total_keys': len(self.data),
            'keys': list(self.data.keys()),
            'config': {
                'write_quorum': self.write_quorum,
                'read_quorum': self.read_quorum,
                'total_nodes': len(self.peers) + 1,
            },
        }


# Create the node
config = {
    'node_id': os.environ.get('NODE_ID', 'node-1'),
    'port': int(os.environ.get('PORT', '4000')),
    'peers': [p for p in os.environ.get('PEERS', '').split(',') if p],
}

node = StoreNode(config['node_id'], config['peers'])


class NodeHandler(BaseHTTPRequestHandler):
    """HTTP request handler for store node."""

    def send_json_response(self, status: int, data: dict):
        """Send a JSON response."""
        self.send_response(status)
        self.send_header('Content-Type', 'application/json')
        self.send_header('Access-Control-Allow-Origin', '*')
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

    def do_OPTIONS(self):
        """Handle CORS preflight."""
        self.send_response(200)
        self.send_header('Access-Control-Allow-Origin', '*')
        self.send_header('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS')
        self.send_header('Access-Control-Allow-Headers', 'Content-Type')
        self.end_headers()

    def do_POST(self):
        """Handle POST requests."""
        parsed = urlparse(self.path)

        if parsed.path == '/internal/heartbeat':
            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')
            try:
                heartbeat = json.loads(body)
                node.handle_heartbeat(heartbeat)
                self.send_json_response(200, {'success': True})
            except (json.JSONDecodeError, KeyError):
                self.send_json_response(400, {'error': 'Invalid request'})
            return

        if parsed.path == '/internal/replicate':
            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')
            try:
                message = json.loads(body)
                node.handle_replication(message)
                self.send_json_response(200, {'success': True})
            except (json.JSONDecodeError, KeyError):
                self.send_json_response(400, {'error': 'Invalid request'})
            return

        if parsed.path == '/internal/get':
            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')
            try:
                data = json.loads(body)
                key = data.get('key')
                value = node.data.get(key)
                self.send_json_response(200, {'value': value, 'version': 1 if value is not None else 0})
            except (json.JSONDecodeError, KeyError):
                self.send_json_response(400, {'error': 'Invalid request'})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def do_GET(self):
        """Handle GET requests."""
        parsed = urlparse(self.path)

        if parsed.path == '/status':
            self.send_json_response(200, node.get_status())
            return

        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]
            consistency = parsed.query.split('=')[-1] if '=' in parsed.query else 'eventual'

            if consistency not in ['strong', 'eventual', 'read_your_writes']:
                consistency = 'eventual'

            value = node.get(key, consistency)
            if value is not None:
                self.send_json_response(200, {
                    'key': key,
                    'value': value,
                    'node_role': node.role,
                    'consistency': consistency,
                })
            else:
                self.send_json_response(404, {'error': 'Key not found', 'key': key})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def do_PUT(self):
        """Handle PUT requests."""
        parsed = urlparse(self.path)

        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]

            if node.role != 'leader':
                self.send_json_response(503, {
                    'error': 'Not the leader',
                    'current_role': node.role,
                    'leader_id': node.leader_id or 'Unknown',
                })
                return

            content_length = int(self.headers.get('Content-Length', 0))
            body = self.rfile.read(content_length).decode('utf-8')

            try:
                value = json.loads(body)
                result = node.set(key, value)
                self.send_json_response(200, {
                    'success': result['success'],
                    'key': key,
                    'value': value,
                    'leader_id': node.node_id,
                    'achieved_quorum': result['achieved_quorum'],
                    'write_quorum': node.write_quorum,
                })
            except json.JSONDecodeError:
                self.send_json_response(400, {'error': 'Invalid JSON'})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def do_DELETE(self):
        """Handle DELETE requests."""
        parsed = urlparse(self.path)

        if parsed.path.startswith('/key/'):
            key = parsed.path[5:]

            if node.role != 'leader':
                self.send_json_response(503, {
                    'error': 'Not the leader',
                    'current_role': node.role,
                    'leader_id': node.leader_id or 'Unknown',
                })
                return

            result = node.delete(key)
            if result['success']:
                self.send_json_response(200, {'success': True, 'key': key, 'leader_id': node.node_id})
            else:
                self.send_json_response(404, {'error': 'Key not found', 'key': key})
            return

        self.send_json_response(404, {'error': 'Not found'})

    def log_message(self, format, *args):
        """Suppress default logging."""
        pass


def run_server(port: int):
    """Start the HTTP server."""
    server_address = ('', port)
    httpd = HTTPServer(server_address, NodeHandler)
    print(f"[{config['node_id']}] Consistent Store listening on port {port}")
    print(f"[{config['node_id']}] Write Quorum (W): {node.write_quorum}, Read Quorum (R): {node.read_quorum}")
    print(f"[{config['node_id']}] Peers: {', '.join(config['peers']) or 'none'}")
    print(f"[{config['node_id']}] Available endpoints:")
    print(f"  GET  /status                         - Node status")
    print(f"  GET  /key/{{key}}?consistency=level   - Get with consistency level")
    print(f"  PUT  /key/{{key}}                      - Set value (leader only)")
    print(f"  DEL  /key/{{key}}                      - Delete key (leader only)")
    httpd.serve_forever()


if __name__ == '__main__':
    run_server(config['port'])

consistent-store-py/requirements.txt

# No external dependencies - uses standard library only

consistent-store-py/Dockerfile

FROM python:3.11-alpine

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 4000

CMD ["python", "src/node.py"]

Docker Compose Setup

TypeScript Version

examples/03-consistent-store/ts/docker-compose.yml

version: '3.8'

services:
  node1:
    build: .
    container_name: consistent-ts-node1
    ports:
      - "4001:4000"
    environment:
      - NODE_ID=node-1
      - PORT=4000
      - PEERS=http://node2:4000,http://node3:4000
      - WRITE_QUORUM=2
      - READ_QUORUM=1
    networks:
      - consistent-network

  node2:
    build: .
    container_name: consistent-ts-node2
    ports:
      - "4002:4000"
    environment:
      - NODE_ID=node-2
      - PORT=4000
      - PEERS=http://node1:4000,http://node3:4000
      - WRITE_QUORUM=2
      - READ_QUORUM=1
    networks:
      - consistent-network

  node3:
    build: .
    container_name: consistent-ts-node3
    ports:
      - "4003:4000"
    environment:
      - NODE_ID=node-3
      - PORT=4000
      - PEERS=http://node1:4000,http://node2:4000
      - WRITE_QUORUM=2
      - READ_QUORUM=1
    networks:
      - consistent-network

networks:
  consistent-network:
    driver: bridge

Python Version

examples/03-consistent-store/py/docker-compose.yml

version: '3.8'

services:
  node1:
    build: .
    container_name: consistent-py-node1
    ports:
      - "4001:4000"
    environment:
      - NODE_ID=node-1
      - PORT=4000
      - PEERS=http://node2:4000,http://node3:4000
      - WRITE_QUORUM=2
      - READ_QUORUM=1
    networks:
      - consistent-network

  node2:
    build: .
    container_name: consistent-py-node2
    ports:
      - "4002:4000"
    environment:
      - NODE_ID=node-2
      - PORT=4000
      - PEERS=http://node1:4000,http://node3:4000
      - WRITE_QUORUM=2
      - READ_QUORUM=1
    networks:
      - consistent-network

  node3:
    build: .
    container_name: consistent-py-node3
    ports:
      - "4003:4000"
    environment:
      - NODE_ID=node-3
      - PORT=4000
      - PEERS=http://node1:4000,http://node2:4000
      - WRITE_QUORUM=2
      - READ_QUORUM=1
    networks:
      - consistent-network

networks:
  consistent-network:
    driver: bridge

Running the Example

Step 1: Start the Cluster

TypeScript:

cd distributed-systems-course/examples/03-consistent-store/ts
docker-compose up --build

Python:

cd distributed-systems-course/examples/03-consistent-store/py
docker-compose up --build

You should see:

consistent-ts-node1 | [node-1] 👑 Became LEADER for term 1
consistent-ts-node1 | [node-1] Write Quorum (W): 2, Read Quorum (R): 1
consistent-ts-node2 | [node-2] Waiting for node-1 to become leader
consistent-ts-node3 | [node-3] Waiting for node-1 to become leader

Step 2: Test Eventual Consistency (Default)

# Write to leader
curl -X PUT http://localhost:4001/key/name \
  -H "Content-Type: application/json" \
  -d '"Alice"'

# Immediately read from follower (eventual consistency)
curl http://localhost:4002/key/name

You might see:

  • Immediately after write: null (follower hasn't received replication yet)
  • A moment later: "Alice" (follower caught up)

Step 3: Test Strong Consistency

# Read with strong consistency (waits for quorum)
curl "http://localhost:4002/key/name?consistency=strong"

This queries multiple nodes and returns the latest confirmed value.

Step 4: Observe Quorum Behavior

Check the status to see your quorum settings:

curl http://localhost:4001/status

Response:

{
  "nodeId": "node-1",
  "role": "leader",
  "config": {
    "writeQuorum": 2,
    "readQuorum": 1,
    "totalNodes": 3
  }
}

Step 5: Test Different Quorum Settings

Stop the docker-compose and modify the environment variables:

Try W=3 (Strongest):

environment:
  - WRITE_QUORUM=3
  - READ_QUORUM=1

Try W=1 (Weakest):

environment:
  - WRITE_QUORUM=1
  - READ_QUORUM=1

Observe how the system behaves differently with each setting.

Consistency Comparison

graph TB
    subgraph "Same Data, Different Consistency Levels"
        W[Write: name = Alice]

        subgraph "Strong Consistency<br/>Slow but Accurate"
            S1[Node 1: Alice]
            S2[Node 2: Alice]
            S3[Node 3: Alice]
            R1[Read → Alice]
        end

        subgraph "Eventual Consistency<br/>Fast but Maybe Stale"
            E1[Node 1: Alice]
            E2[Node 2: Bob]
            E3[Node 3: ???]
            R2[Read → Bob or ???]
        end
    end

    W --> S1
    W --> S2
    W --> S3
    W --> E1
    W -.->|delayed| E2
    W -.->|delayed| E3

    style R1 fill:#6f6
    style R2 fill:#f96

Exercises

Exercise 1: Experience Eventual Consistency

  1. Start the cluster
  2. Write a value to the leader
  3. Immediately read from a follower (within 100ms)
  4. What do you see? Is it the new value or old?

Exercise 2: Compare Consistency Levels

Write a script that:

  1. Sets a key to a new value
  2. Immediately reads it with consistency=eventual
  3. Immediately reads it with consistency=strong
  4. Compare the results

Exercise 3: Adjust Quorum for Different Use Cases

For each scenario, what quorum settings would you choose?

ScenarioW (Write)R (Read)R + WConsistencyWhy?
Bank balance transfer????
Social media like????
Shopping cart????
User profile view????

Exercise 4: Implement Read Repair

When a stale read is detected, update the stale node with the latest value. Hint: In the strong consistency read, if you find a newer value on one node, send it to nodes with older values.

Summary

Key Takeaways

  1. Consistency is a spectrum from strong to eventual
  2. Strong consistency = always see latest data, but slower
  3. Eventual consistency = fast reads, but might see stale data
  4. Quorum configuration (W + R) controls consistency level:
    • R + W > N → Strong consistency
    • R + W ≤ N → Eventual consistency
  5. Trade-off: You can't have both strong consistency AND high availability (CAP theorem)

Consistency Decision Tree

Need to read latest data immediately?
├─ Yes → Use strong consistency (R + W > N)
│  └─ Accept slower performance
└─ No → Use eventual consistency (R + W ≤ N)
   └─ Get faster reads, accept some staleness

Real-World Examples

SystemDefault ConsistencyConfigurable?
DynamoDBEventually consistentYes (ConsistentRead parameter)
CassandraEventually consistentYes (CONSISTENCY level)
MongoDBStrong (w:majority)Yes (writeConcern, readConcern)
CouchDBEventually consistentYes (r, w parameters)
etcdStrongNo (always strong)

Check Your Understanding

  • What's the difference between strong and eventual consistency?
  • How does quorum configuration (R, W) affect consistency?
  • When would you choose eventual consistency over strong?
  • What does R + W > N guarantee?
  • Why can't we have both strong consistency and high availability during partitions?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What's Next

We've built a replicated store with configurable consistency. Now let's add real-time communication: WebSockets (Session 6)

WebSockets

Session 6, Part 1 - 20 minutes

Learning Objectives

  • Understand the WebSocket protocol and its advantages over HTTP
  • Learn the WebSocket connection lifecycle
  • Implement WebSocket servers and clients in TypeScript and Python
  • Handle connection management and error scenarios

Introduction

In previous sessions, we built systems using HTTP—a request-response protocol. The client asks, the server answers. But what if we need real-time, bidirectional communication?

Enter WebSockets: a protocol that enables full-duplex communication over a single TCP connection.

sequenceDiagram
    participant Client
    participant Server

    Note over Client,Server: HTTP Request-Response (Traditional)
    Client->>Server: GET /data
    Server-->>Client: Response
    Client->>Server: GET /data
    Server-->>Client: Response

    Note over Client,Server: WebSocket (Real-Time)
    Client->>Server: HTTP Upgrade Request
    Server-->>Client: 101 Switching Protocols
    Client->>Server: Message 1
    Server-->>Client: Message 2
    Client->>Server: Message 3
    Server-->>Client: Message 4
    Client->>Server: Message 5

WebSocket vs HTTP

AspectHTTPWebSocket
CommunicationHalf-duplex (request-response)Full-duplex (bidirectional)
ConnectionNew connection per requestPersistent connection
LatencyHigher (HTTP overhead)Lower (frames, not packets)
StateStatelessStateful connection
Server PushRequires polling/SSENative push support

When to Use WebSockets

Great for:

  • Chat applications
  • Real-time collaboration (editing, gaming)
  • Live dashboards and monitoring
  • Multiplayer games

Not ideal for:

  • Simple CRUD operations (use REST)
  • One-time data fetching
  • Stateless resource access

The WebSocket Protocol

Connection Handshake

WebSockets start as HTTP, then upgrade to the WebSocket protocol:

stateDiagram-v2
    [*] --> HTTP: Client sends HTTP request
    HTTP --> Handshake: Server receives
    Handshake --> WebSocket: 101 Switching Protocols
    WebSocket --> Connected: Full-duplex established
    Connected --> Messaging: Send/receive frames
    Messaging --> Closing: Close frame sent
    Closing --> [*]: Connection terminated

HTTP Request (Upgrade):

GET /chat HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

HTTP Response (Accept):

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

Frame Structure

WebSocket messages are sent as frames, not HTTP packets:

+--------+--------+--------+--------+     +--------+
| FIN    | RSV1-3 | Opcode | Mask   | ... | Payload|
| 1 bit  | 3 bits | 4 bits | 1 bit  |     |        |
+--------+--------+--------+--------+     +--------+

Common Opcodes:
- 0x1: Text frame
- 0x2: Binary frame
- 0x8: Close connection
- 0x9: Ping
- 0xA: Pong

WebSocket Lifecycle

stateDiagram-v2
    [*] --> Connecting: ws://localhost:8080
    Connecting --> Open: Handshake complete (101)
    Open --> Message: Send/Receive data
    Message --> Open: Continue
    Open --> Closing: Normal close or error
    Closing --> Closed: TCP connection ends
    Closed --> [*]

    note right of Connecting
        Client sends HTTP Upgrade
        Server responds with 101
    end note

    note right of Message
        Full-duplex messaging
        No overhead per message
    end note

    note right of Closing
        Close frame exchange
        Graceful shutdown
    end note

Implementation: TypeScript

We'll use the ws library—the de facto standard for WebSockets in Node.js.

Server Implementation

// examples/03-chat/ts/ws-server.ts
import { WebSocketServer, WebSocket } from 'ws';

interface ChatMessage {
  type: 'message' | 'join' | 'leave';
  username: string;
  content: string;
  timestamp: number;
}

const wss = new WebSocketServer({ port: 8080 });

const clients = new Map<WebSocket, string>();

console.log('WebSocket server running on ws://localhost:8080');

wss.on('connection', (ws: WebSocket) => {
  console.log('New client connected');

  // Welcome message
  ws.send(JSON.stringify({
    type: 'message',
    username: 'System',
    content: 'Welcome! Please identify yourself.',
    timestamp: Date.now()
  } as ChatMessage));

  // Handle incoming messages
  ws.on('message', (data: Buffer) => {
    try {
      const message: ChatMessage = JSON.parse(data.toString());

      if (message.type === 'join') {
        // Register username
        clients.set(ws, message.username);
        console.log(`${message.username} joined`);

        // Broadcast to all clients
        broadcast({
          type: 'message',
          username: 'System',
          content: `${message.username} has joined the chat`,
          timestamp: Date.now()
        });
      } else if (message.type === 'message') {
        const username = clients.get(ws) || 'Anonymous';
        console.log(`${username}: ${message.content}`);

        // Broadcast the message
        broadcast({
          type: 'message',
          username,
          content: message.content,
          timestamp: Date.now()
        });
      }
    } catch (error) {
      console.error('Invalid message:', error);
    }
  });

  // Handle disconnection
  ws.on('close', () => {
    const username = clients.get(ws);
    if (username) {
      console.log(`${username} disconnected`);
      clients.delete(ws);

      broadcast({
        type: 'message',
        username: 'System',
        content: `${username} has left the chat`,
        timestamp: Date.now()
      });
    }
  });

  // Handle errors
  ws.on('error', (error) => {
    console.error('WebSocket error:', error);
  });
});

function broadcast(message: ChatMessage): void {
  const data = JSON.stringify(message);

  wss.clients.forEach((client) => {
    if (client.readyState === WebSocket.OPEN) {
      client.send(data);
    }
  });
}

// Heartbeat to detect stale connections
const interval = setInterval(() => {
  wss.clients.forEach((ws) => {
    if (ws.isAlive === false) {
      return ws.terminate();
    }

    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

wss.on('close', () => {
  clearInterval(interval);
});

Client Implementation

// examples/03-chat/ts/ws-client.ts
import { WebSocket } from 'ws';

interface ChatMessage {
  type: 'message' | 'join' | 'leave';
  username: string;
  content: string;
  timestamp: number;
}

class ChatClient {
  private ws: WebSocket;
  private username: string;
  private reconnectAttempts = 0;
  private readonly maxReconnectAttempts = 5;

  constructor(url: string, username: string) {
    this.username = username;
    this.ws = this.connect(url);
  }

  private connect(url: string): WebSocket {
    const ws = new WebSocket(url);

    ws.on('open', () => {
      console.log('Connected to chat server');
      this.reconnectAttempts = 0;

      // Identify ourselves
      this.send({
        type: 'join',
        username: this.username,
        content: '',
        timestamp: Date.now()
      });
    });

    ws.on('message', (data: Buffer) => {
      const message: ChatMessage = JSON.parse(data.toString());
      this.displayMessage(message);
    });

    ws.on('close', () => {
      console.log('Disconnected from server');

      // Attempt reconnection
      if (this.reconnectAttempts < this.maxReconnectAttempts) {
        this.reconnectAttempts++;
        const delay = Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000);

        console.log(`Reconnecting in ${delay}ms... (attempt ${this.reconnectAttempts})`);

        setTimeout(() => {
          this.ws = this.connect(url);
        }, delay);
      }
    });

    ws.on('error', (error) => {
      console.error('WebSocket error:', error.message);
    });

    // Respond to pings
    ws.on('ping', () => {
      ws.pong();
    });

    return ws;
  }

  public send(message: ChatMessage): void {
    if (this.ws.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify(message));
    } else {
      console.error('Cannot send message: connection not open');
    }
  }

  public sendMessage(content: string): void {
    this.send({
      type: 'message',
      username: this.username,
      content,
      timestamp: Date.now()
    });
  }

  private displayMessage(message: ChatMessage): void {
    const time = new Date(message.timestamp).toLocaleTimeString();
    console.log(`[${time}] ${message.username}: ${message.content}`);
  }

  public close(): void {
    this.ws.close();
  }
}

// CLI interface
const username = process.argv[2] || `User${Math.floor(Math.random() * 1000)}`;
const client = new ChatClient('ws://localhost:8080', username);

console.log(`You are logged in as: ${username}`);
console.log('Type a message and press Enter to send. Press Ctrl+C to exit.');

// Read from stdin
process.stdin.setEncoding('utf8');
process.stdin.on('data', (chunk: Buffer) => {
  const text = chunk.toString().trim();
  if (text) {
    client.sendMessage(text);
  }
});

// Handle graceful shutdown
process.on('SIGINT', () => {
  console.log('\nShutting down...');
  client.close();
  process.exit(0);
});

Package Configuration

// examples/03-chat/ts/package.json
{
  "name": "chat-websocket-example",
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "server": "node --loader ts-node/esm ws-server.ts",
    "client": "node --loader ts-node/esm ws-client.ts"
  },
  "dependencies": {
    "ws": "^8.18.0"
  },
  "devDependencies": {
    "@types/ws": "^8.5.12",
    "ts-node": "^10.9.2",
    "typescript": "^5.6.3"
  }
}

Implementation: Python

We'll use the websockets library—a fully compliant WebSocket implementation.

Server Implementation

# examples/03-chat/py/ws_server.py
import asyncio
import json
import logging
from datetime import datetime
from typing import Set
import websockets
from websockets.server import WebSocketServerProtocol

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Track connected clients
clients: Set[WebSocketServerProtocol] = set()
usernames: dict[WebSocketServerProtocol, str] = {}


async def broadcast(message: dict) -> None:
    """Send a message to all connected clients."""
    if clients:
        await asyncio.gather(
            *[client.send(json.dumps(message)) for client in clients if client.open],
            return_exceptions=True
        )


async def handle_client(websocket: WebSocketServerProtocol) -> None:
    """Handle a client connection."""
    clients.add(websocket)
    logger.info(f"New client connected. Total clients: {len(clients)}")

    try:
        # Send welcome message
        welcome_msg = {
            "type": "message",
            "username": "System",
            "content": "Welcome! Please identify yourself.",
            "timestamp": datetime.now().timestamp()
        }
        await websocket.send(json.dumps(welcome_msg))

        # Handle messages
        async for message in websocket:
            try:
                data = json.loads(message)

                if data.get("type") == "join":
                    # Register username
                    username = data.get("username", "Anonymous")
                    usernames[websocket] = username
                    logger.info(f"{username} joined")

                    # Broadcast join notification
                    await broadcast({
                        "type": "message",
                        "username": "System",
                        "content": f"{username} has joined the chat",
                        "timestamp": datetime.now().timestamp()
                    })

                elif data.get("type") == "message":
                    # Broadcast the message
                    username = usernames.get(websocket, "Anonymous")
                    content = data.get("content", "")
                    logger.info(f"{username}: {content}")

                    await broadcast({
                        "type": "message",
                        "username": username,
                        "content": content,
                        "timestamp": datetime.now().timestamp()
                    })

            except json.JSONDecodeError:
                logger.error("Invalid JSON received")
            except Exception as e:
                logger.error(f"Error handling message: {e}")

    except websockets.exceptions.ConnectionClosed:
        logger.info("Client disconnected unexpectedly")
    finally:
        # Cleanup
        username = usernames.get(websocket)
        if username:
            del usernames[websocket]
            await broadcast({
                "type": "message",
                "username": "System",
                "content": f"{username} has left the chat",
                "timestamp": datetime.now().timestamp()
            })

        clients.discard(websocket)
        logger.info(f"Client removed. Total clients: {len(clients)}")


async def main():
    """Start the WebSocket server."""
    async with websockets.serve(handle_client, "localhost", 8080):
        logger.info("WebSocket server running on ws://localhost:8080")
        await asyncio.Future()  # Run forever


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        logger.info("Server stopped")

Client Implementation

# examples/03-chat/py/ws_client.py
import asyncio
import json
import sys
from datetime import datetime
import websockets
from websockets.client import WebSocketClientProtocol


class ChatClient:
    def __init__(self, url: str, username: str):
        self.url = url
        self.username = username
        self.websocket: WebSocketClientProtocol | None = None
        self.reconnect_attempts = 0
        self.max_reconnect_attempts = 5

    async def connect(self) -> None:
        """Connect to the WebSocket server."""
        backoff = 1

        while self.reconnect_attempts < self.max_reconnect_attempts:
            try:
                async with websockets.connect(self.url) as ws:
                    self.websocket = ws
                    self.reconnect_attempts = 0
                    print(f"Connected to {self.url}")

                    # Identify ourselves
                    await self.send({
                        "type": "join",
                        "username": self.username,
                        "content": "",
                        "timestamp": datetime.now().timestamp()
                    })

                    # Start receiving messages
                    receive_task = asyncio.create_task(self.receive_messages())

                    # Wait for connection to close
                    await ws.wait_closed()

                    # Cancel receive task
                    receive_task.cancel()
                    try:
                        await receive_task
                    except asyncio.CancelledError:
                        pass

                    print("Disconnected from server")

            except (ConnectionRefusedError, OSError) as e:
                self.reconnect_attempts += 1
                print(f"Connection failed: {e}")
                print(f"Reconnecting in {backoff}s... (attempt {self.reconnect_attempts})")

                await asyncio.sleep(backoff)
                backoff = min(backoff * 2, 30)

        print("Max reconnection attempts reached. Giving up.")

    async def receive_messages(self) -> None:
        """Receive and display messages from the server."""
        if not self.websocket:
            return

        try:
            async for message in self.websocket:
                data = json.loads(message)
                self.display_message(data)
        except asyncio.CancelledError:
            pass
        except Exception as e:
            print(f"Error receiving message: {e}")

    async def send(self, message: dict) -> None:
        """Send a message to the server."""
        if self.websocket and not self.websocket.closed:
            await self.websocket.send(json.dumps(message))
        else:
            print("Cannot send message: connection not open")

    def display_message(self, message: dict) -> None:
        """Display a received message."""
        timestamp = datetime.fromtimestamp(message["timestamp"]).strftime("%H:%M:%S")
        print(f"[{timestamp}] {message['username']}: {message['content']}")


async def stdin_reader(client: ChatClient):
    """Read from stdin and send messages."""
    loop = asyncio.get_event_loop()

    while True:
        line = await loop.run_in_executor(None, sys.stdin.readline)
        text = line.strip()

        if text:
            await client.send({
                "type": "message",
                "username": client.username,
                "content": text,
                "timestamp": datetime.now().timestamp()
            })


async def main():
    """Run the chat client."""
    username = sys.argv[1] if len(sys.argv) > 1 else f"User{asyncio.get_event_loop().time() % 1000:.0f}"
    client = ChatClient("ws://localhost:8080", username)

    print(f"You are logged in as: {username}")
    print("Type a message and press Enter to send. Press Ctrl+C to exit.")

    # Run connection and stdin reader concurrently
    connect_task = asyncio.create_task(client.connect())

    # Give connection time to establish
    await asyncio.sleep(0.5)

    stdin_task = asyncio.create_task(stdin_reader(client))

    try:
        await asyncio.gather(connect_task, stdin_task)
    except KeyboardInterrupt:
        print("\nShutting down...")
    finally:
        connect_task.cancel()
        stdin_task.cancel()


if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        pass

Requirements

# examples/03-chat/py/requirements.txt
websockets==13.1

Docker Compose Setup

TypeScript Version

# examples/03-chat/ts/docker-compose.yml
version: '3.8'

services:
  server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - NODE_ENV=production
    restart: unless-stopped
# examples/03-chat/ts/Dockerfile
FROM node:20-alpine

WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci --only=production

COPY . .
RUN npx tsc

EXPOSE 8080

CMD ["node", "dist/ws-server.js"]

Python Version

# examples/03-chat/py/docker-compose.yml
version: '3.8'

services:
  server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    restart: unless-stopped
# examples/03-chat/py/Dockerfile
FROM python:3.12-alpine

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8080

CMD ["python", "ws_server.py"]

Running the Examples

TypeScript

# Install dependencies
cd examples/03-chat/ts
npm install

# Start the server
npm run server

# In another terminal, start a client
npm run client Alice

# In another terminal, start another client
npm run client Bob

Python

# Install dependencies
cd examples/03-chat/py
pip install -r requirements.txt

# Start the server
python ws_server.py

# In another terminal, start a client
python ws_client.py Alice

# In another terminal, start another client
python ws_client.py Bob

With Docker

# Start the server
docker-compose up -d

# Check logs
docker-compose logs -f

# Connect with a client (run from host)
npm run client Alice  # or python ws_client.py Alice

Connection Management Best Practices

1. Heartbeat/Ping-Pong

Detect stale connections before they cause issues:

// Server sends ping every 30 seconds
setInterval(() => {
  wss.clients.forEach((ws) => {
    if (ws.isAlive === false) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

// Client responds automatically
ws.on('ping', () => ws.pong());

2. Exponential Backoff Reconnection

Don't hammer the server when it's down:

function reconnect(attempts: number) {
  const delay = Math.min(1000 * Math.pow(2, attempts), 30000);
  setTimeout(() => connect(), delay);
}

3. Graceful Shutdown

// Send close frame before terminating
ws.close(1000, 'Normal closure');

// Wait for close frame acknowledgement
ws.on('close', () => {
  console.log('Connection closed cleanly');
});

4. Message Serialization

Always validate incoming messages:

function safeParse(data: string): Message | null {
  try {
    const msg = JSON.parse(data);
    if (msg.type && msg.username) {
      return msg;
    }
  } catch {}
  return null;
}

Common Pitfalls

PitfallSymptomSolution
Not handling reconnectionClient stops working on network blipImplement exponential backoff reconnection
Ignoring the close eventMemory leaks from stale clientsAlways clean up on disconnect
Blocking the event loopMessages delayedUse async/await properly, avoid CPU-heavy work
  • Missing heartbeat | Stale connections remain | Implement ping/pong |
  • Not validating messages | Crashes on malformed data | Always try/catch JSON parsing |

Testing Your WebSocket Implementation

# Using websocat (like curl for WebSockets)
# Install: cargo install websocat

# Connect and send/receive messages
echo '{"type":"join","username":"TestUser","content":"","timestamp":123456}' | \
  websocat ws://localhost:8080

# Interactive mode
websocat ws://localhost:8080

Summary

WebSockets enable real-time, bidirectional communication between clients and servers:

  • Protocol: HTTP upgrade handshake → persistent TCP connection
  • Communication: Full-duplex messaging with minimal overhead
  • Lifecycle: Connecting → Open → Messaging → Closing → Closed
  • Best practices: Heartbeats, graceful shutdown, reconnection handling

In the next section, we'll build on this foundation to implement pub/sub messaging for multi-room chat systems.

Exercises

Exercise 1: Add Private Messaging

Extend the chat system to support private messages between users:

// Message format for private messages
{
  type: 'private',
  from: 'Alice',
  to: 'Bob',
  content: 'Hey Bob, are you there?',
  timestamp: 1234567890
}

Requirements:

  1. Add a private message type
  2. Route private messages only to the intended recipient
  3. Show "private message" indicator in the UI

Exercise 2: Typing Indicators

Show when a user is typing:

// Typing indicator message
{
  type: 'typing',
  username: 'Alice',
  isTyping: true,
  timestamp: 1234567890
}

Requirements:

  1. Send typing.start when user starts typing
  2. Send typing.stop after 2 seconds of inactivity
  3. Display "Alice is typing..." to other users

Exercise 3: Connection Status

Display real-time connection status to the user:

Requirements:

  1. Show: Connecting → Connected → Disconnected → Reconnecting
  2. Use visual indicators (green dot, red dot, spinner)
  3. Display ping/pong latency in milliseconds

Exercise 4: Message History with Reconnection

When a client reconnects, send them messages they missed:

Requirements:

  1. Store last 100 messages on the server
  2. When client reconnects, send messages since their last timestamp
  3. Deduplicate messages the client already has

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

Pub/Sub Messaging and Message Ordering

Session 7, Part 1 - 45 minutes

Learning Objectives

  • Understand the publish-subscribe messaging pattern
  • Learn about topic-based and content-based routing
  • Implement presence tracking and subscriptions
  • Understand message ordering challenges in distributed systems
  • Implement sequence numbers for causal ordering

What is Pub/Sub?

The publish-subscribe pattern is a messaging pattern where senders (publishers) send messages to an intermediate system, and the system routes messages to interested receivers (subscribers). Publishers and subscribers are decoupled—they don't know about each other.

Key Benefits

  1. Decoupling: Publishers don't need to know who subscribes
  2. Scalability: Add subscribers without changing publishers
  3. Flexibility: Dynamic subscription management
  4. Asynchrony: Publishers send and continue; subscribers process when ready

Pub/Sub vs Direct Messaging

graph TB
    subgraph "Direct Messaging"
        P1[Producer] -->|Direct| C1[Consumer 1]
        P1 -->|Direct| C2[Consumer 2]
        P1 -->|Direct| C3[Consumer 3]
    end

    subgraph "Pub/Sub Messaging"
        P2[Publisher] -->|Publish| B[Broker]
        S1[Subscriber 1] -->|Subscribe| B
        S2[Subscriber 2] -->|Subscribe| B
        S3[Subscriber 3] -->|Subscribe| B
    end
AspectDirect MessagingPub/Sub
CouplingTight (producer knows consumers)Loose (producer doesn't know consumers)
FlexibilityLow (changes affect producer)High (dynamic subscriptions)
ComplexitySimpleModerate (requires broker)
Use CasePoint-to-point, request-responseBroadcast, events, notifications

Pub/Sub Patterns

1. Topic-Based Routing

Subscribers express interest in topics (channels). Messages are routed based on the topic they're published to.

sequenceDiagram
    participant S1 as Subscriber 1
    participant S2 as Subscriber 2
    participant S3 as Subscriber 3
    participant B as Broker
    participant P as Publisher

    Note over S1,S3: Subscription Phase
    S1->>B: subscribe("sports")
    S2->>B: subscribe("sports")
    S3->>B: subscribe("news")

    Note over S1,S3: Publishing Phase
    P->>B: publish("sports", "Game starts!")
    B->>S1: deliver("Game starts!")
    B->>S2: deliver("Game starts!")

    P->>B: publish("news", "Breaking story!")
    B->>S3: deliver("Breaking story!")

Use cases: Chat rooms, notification categories, event streams

2. Content-Based Routing

Subscribers specify filter criteria. Messages are routed based on their content.

graph LR
    P[Publisher] -->|{"type": "order", "value": >100}| B[Content Router]
    B -->|Matches filter| S1[High-Value Handler]
    B -->|Matches filter| S2[Order Logger]
    B -.->|No match| S3[Low-Value Handler]

Use cases: Event filtering, complex routing rules, IoT sensor data

3. Presence Tracking

In real-time systems, knowing who is online (presence) is essential for:

  • Showing online/offline status
  • Delivering messages only to active users
  • Managing connections and reconnections
  • Handling user disconnections gracefully
stateDiagram-v2
    [*] --> Offline: User created
    Offline --> Connecting: Connect request
    Connecting --> Online: Auth success
    Connecting --> Offline: Auth fail
    Online --> Away: No activity
    Online --> Offline: Disconnect
    Away --> Online: Activity detected
    Online --> [*]: User deleted

Message Ordering

The Ordering Problem

In distributed systems, messages may arrive out of order due to:

  • Network latency variations
  • Multiple servers processing messages
  • Message retries and retransmissions
  • Concurrent publishers

Types of Ordering

Ordering TypeDescriptionDifficulty
FIFOMessages from same sender arrive in order sentEasy
CausalCausally related messages are orderedModerate
TotalAll messages ordered globallyHard

Why Ordering Matters

Consider a chat application:

sequenceDiagram
    participant A as Alice
    participant S as Server
    participant B as Bob

    Note over A,B: Without ordering - confusion!
    A->>S: "Let's meet at 5pm"
    A->>S: "Never mind, 6pm instead"
    S--xB: "Never mind, 6pm instead"
    S--xB: "Let's meet at 5pm"

    Note over B: Bob sees messages out of order!

With proper ordering using sequence numbers:

sequenceDiagram
    participant A as Alice
    participant S as Server
    participant B as Bob

    Note over A,B: With sequence numbers - correct!
    A->>S: [msg#1] "Let's meet at 5pm"
    A->>S: [msg#2] "Never mind, 6pm instead"

    S--xB: [msg#1] "Let's meet at 5pm"
    S--xB: [msg#2] "Never mind, 6pm instead"

    Note over B: Bob delivers in order by sequence number

Implementation: Pub/Sub Chat with Ordering

Let's build a pub/sub chat system with:

  • Topic-based routing (chat rooms)
  • Presence tracking
  • Message ordering with sequence numbers

TypeScript Implementation

pubsub-server.ts - Pub/Sub server with ordering:

// src: examples/03-chat/ts/pubsub-server.ts

interface Message {
  id: string;
  room: string;
  sender: string;
  content: string;
  sequence: number;
  timestamp: number;
}

interface Subscriber {
  id: string;
  userId: string;
  rooms: Set<string>;
  ws: WebSocket;
}

class PubSubServer {
  private subscribers: Map<string, Subscriber> = new Map();
  private roomSequences: Map<string, number> = new Map();
  private messageHistory: Map<string, Message[]> = new Map();
  private server: WebSocketServer;

  constructor(port: number = 8080) {
    this.server = new WebSocketServer({ port });
    this.setupHandlers();
    console.log(`Pub/Sub server running on port ${port}`);
  }

  private setupHandlers() {
    this.server.on('connection', (ws: WebSocket) => {
      const subscriberId = this.generateId();

      ws.on('message', (data: string) => {
        try {
          const msg = JSON.parse(data.toString());
          this.handleMessage(subscriberId, msg, ws);
        } catch (err) {
          ws.send(JSON.stringify({ error: 'Invalid message format' }));
        }
      });

      ws.on('close', () => {
        this.handleDisconnect(subscriberId);
      });
    });
  }

  private handleMessage(subscriberId: string, msg: any, ws: WebSocket) {
    switch (msg.type) {
      case 'subscribe':
        this.handleSubscribe(subscriberId, msg.room, msg.userId, ws);
        break;
      case 'unsubscribe':
        this.handleUnsubscribe(subscriberId, msg.room);
        break;
      case 'publish':
        this.handlePublish(msg);
        break;
      case 'get_history':
        this.handleGetHistory(msg.room, ws);
        break;
    }
  }

  private handleSubscribe(
    subscriberId: string,
    room: string,
    userId: string,
    ws: WebSocket
  ) {
    if (!this.subscribers.has(subscriberId)) {
      this.subscribers.set(subscriberId, {
        id: subscriberId,
        userId,
        rooms: new Set(),
        ws,
      });
    }

    const subscriber = this.subscribers.get(subscriberId)!;
    subscriber.rooms.add(room);

    // Initialize room state if needed
    if (!this.roomSequences.has(room)) {
      this.roomSequences.set(room, 0);
      this.messageHistory.set(room, []);
    }

    // Send presence notification
    this.broadcast(room, {
      type: 'presence',
      userId,
      action: 'join',
      timestamp: Date.now(),
    });

    // Send current sequence number
    ws.send(JSON.stringify({
      type: 'subscribed',
      room,
      sequence: this.roomSequences.get(room),
    }));

    console.log(`${userId} subscribed to ${room}`);
  }

  private handleUnsubscribe(subscriberId: string, room: string) {
    const subscriber = this.subscribers.get(subscriberId);
    if (subscriber) {
      subscriber.rooms.delete(room);

      // Send presence notification
      this.broadcast(room, {
        type: 'presence',
        userId: subscriber.userId,
        action: 'leave',
        timestamp: Date.now(),
      });
    }
  }

  private handlePublish(msg: any) {
    const { room, sender, content } = msg;
    const sequence = (this.roomSequences.get(room) || 0) + 1;
    this.roomSequences.set(room, sequence);

    const message: Message = {
      id: this.generateId(),
      room,
      sender,
      content,
      sequence,
      timestamp: Date.now(),
    };

    // Store in history
    const history = this.messageHistory.get(room) || [];
    history.push(message);
    this.messageHistory.set(room, history.slice(-100)); // Keep last 100

    // Broadcast to all subscribers
    this.broadcast(room, {
      type: 'message',
      ...message,
    });
  }

  private handleGetHistory(room: string, ws: WebSocket) {
    const history = this.messageHistory.get(room) || [];
    ws.send(JSON.stringify({
      type: 'history',
      room,
      messages: history,
    }));
  }

  private broadcast(room: string, payload: any) {
    const payloadStr = JSON.stringify(payload);

    for (const [_, subscriber] of this.subscribers) {
      if (subscriber.rooms.has(room) && subscriber.ws.readyState === WebSocket.OPEN) {
        subscriber.ws.send(payloadStr);
      }
    }
  }

  private handleDisconnect(subscriberId: string) {
    const subscriber = this.subscribers.get(subscriberId);
    if (subscriber) {
      // Notify all rooms the user was in
      for (const room of subscriber.rooms) {
        this.broadcast(room, {
          type: 'presence',
          userId: subscriber.userId,
          action: 'leave',
          timestamp: Date.now(),
        });
      }
      this.subscribers.delete(subscriberId);
    }
  }

  private generateId(): string {
    return Math.random().toString(36).substring(2, 15);
  }
}

const PORT = parseInt(process.env.PORT || '8080');
new PubSubServer(PORT);

pubsub-client.ts - Client with ordering buffer:

// src: examples/03-chat/ts/pubsub-client.ts

interface ClientMessage {
  type: string;
  sequence?: number;
  [key: string]: any;
}

class PubSubClient {
  private ws: WebSocket | null = null;
  private userId: string;
  private messageBuffer: Map<string, Map<number, ClientMessage>> = new Map();
  private expectedSequence: Map<string, number> = new Map();
  private reconnectAttempts = 0;
  private maxReconnectAttempts = 5;

  constructor(
    private url: string,
    userId?: string
  ) {
    this.userId = userId || `user-${Math.random().toString(36).substring(7)}`;
  }

  connect() {
    this.ws = new WebSocket(this.url);

    this.ws.on('open', () => {
      console.log(`Connected as ${this.userId}`);
      this.reconnectAttempts = 0;
    });

    this.ws.on('message', (data: string) => {
      const msg: ClientMessage = JSON.parse(data.toString());
      this.handleMessage(msg);
    });

    this.ws.on('close', () => {
      console.log('Disconnected. Attempting to reconnect...');
      this.reconnect();
    });

    this.ws.on('error', (err) => {
      console.error('WebSocket error:', err);
    });
  }

  private handleMessage(msg: ClientMessage) {
    switch (msg.type) {
      case 'subscribed':
        this.expectedSequence.set(msg.room, (msg.sequence || 0) + 1);
        console.log(`Subscribed to ${msg.room} at sequence ${msg.sequence}`);
        break;

      case 'message':
        this.handleOrderedMessage(msg.room, msg);
        break;

      case 'presence':
        console.log(`${msg.userId} ${msg.action}ed`);
        break;

      case 'history':
        console.log(`Received ${msg.messages.length} historical messages`);
        msg.messages.forEach((m: ClientMessage) => this.displayMessage(m));
        break;
    }
  }

  private handleOrderedMessage(room: string, msg: ClientMessage) {
    const seq = msg.sequence!;

    // Initialize buffer if needed
    if (!this.messageBuffer.has(room)) {
      this.messageBuffer.set(room, new Map());
    }
    const buffer = this.messageBuffer.get(room)!;
    const expected = this.expectedSequence.get(room) || 1;

    if (seq === expected) {
      // Expected message - deliver immediately
      this.displayMessage(msg);
      this.expectedSequence.set(room, seq + 1);

      // Check buffer for next messages
      this.deliverBufferedMessages(room);
    } else if (seq > expected) {
      // Future message - buffer it
      buffer.set(seq, msg);
      console.log(`Buffered message ${seq} (expecting ${expected})`);
    }
    // seq < expected: old message, ignore
  }

  private deliverBufferedMessages(room: string) {
    const buffer = this.messageBuffer.get(room);
    if (!buffer) return;

    const expected = this.expectedSequence.get(room) || 1;

    while (buffer.has(expected)) {
      const msg = buffer.get(expected)!;
      this.displayMessage(msg);
      buffer.delete(expected);
      this.expectedSequence.set(room, expected + 1);
    }
  }

  private displayMessage(msg: ClientMessage) {
    console.log(`[${msg.sequence}] ${msg.sender}: ${msg.content}`);
  }

  subscribe(room: string) {
    this.send({ type: 'subscribe', room, userId: this.userId });
  }

  unsubscribe(room: string) {
    this.send({ type: 'unsubscribe', room });
  }

  publish(room: string, content: string) {
    this.send({
      type: 'publish',
      room,
      sender: this.userId,
      content,
    });
  }

  getHistory(room: string) {
    this.send({ type: 'get_history', room });
  }

  private send(payload: any) {
    if (this.ws?.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify(payload));
    } else {
      console.error('WebSocket not connected');
    }
  }

  private reconnect() {
    if (this.reconnectAttempts < this.maxReconnectAttempts) {
      this.reconnectAttempts++;
      const delay = Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000);
      setTimeout(() => this.connect(), delay);
    } else {
      console.error('Max reconnection attempts reached');
    }
  }
}

// CLI usage
const args = process.argv.slice(2);
const url = args[0] || 'ws://localhost:8080';
const client = new PubSubClient(url);

client.connect();

// Simple readline interface
const readline = require('readline');
const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
});

console.log('Commands: /join <room>, /leave <room>, /history <room>, /quit');
console.log('Any other input will be sent to the current room');

let currentRoom = '';

const showPrompt = () => {
  if (currentRoom) {
    rl.question(`[${currentRoom}]> `, (input) => {
      if (input === '/quit') {
        client.ws?.close();
        rl.close();
        process.exit(0);
      } else if (input.startsWith('/join ')) {
        currentRoom = input.substring(6);
        client.subscribe(currentRoom);
      } else if (input.startsWith('/leave ')) {
        const room = input.substring(7);
        client.unsubscribe(room);
        if (room === currentRoom) currentRoom = '';
      } else if (input.startsWith('/history ')) {
        const room = input.substring(9);
        client.getHistory(room);
      } else if (input && currentRoom) {
        client.publish(currentRoom, input);
      }
      showPrompt();
    });
  } else {
    rl.question('(no room)> ', (input) => {
      if (input.startsWith('/join ')) {
        currentRoom = input.substring(6);
        client.subscribe(currentRoom);
      }
      showPrompt();
    });
  }
};

showPrompt();

Python Implementation

pubsub_server.py - Pub/Sub server with ordering:

# src: examples/03-chat/py/pubsub_server.py

import asyncio
import json
import time
from typing import Dict, Set, List
from dataclasses import dataclass, asdict
import websockets
from websockets.server import WebSocketServerProtocol

@dataclass
class Message:
    id: str
    room: str
    sender: str
    content: str
    sequence: int
    timestamp: int

class PubSubServer:
    def __init__(self, port: int = 8080):
        self.port = port
        self.subscribers: Dict[str, dict] = {}
        self.room_sequences: Dict[str, int] = {}
        self.message_history: Dict[str, List[Message]] = {}

    async def handle_connection(self, ws: WebSocketServerProtocol):
        subscriber_id = self._generate_id()

        try:
            async for message in ws:
                try:
                    data = json.loads(message)
                    await self.handle_message(subscriber_id, data, ws)
                except json.JSONDecodeError:
                    await ws.send(json.dumps({"error": "Invalid message format"}))
        finally:
            await self.handle_disconnect(subscriber_id)

    async def handle_message(self, subscriber_id: str, msg: dict, ws: WebSocketServerProtocol):
        msg_type = msg.get("type")

        if msg_type == "subscribe":
            await self.handle_subscribe(subscriber_id, msg["room"], msg["userId"], ws)
        elif msg_type == "unsubscribe":
            await self.handle_unsubscribe(subscriber_id, msg["room"])
        elif msg_type == "publish":
            await self.handle_publish(msg)
        elif msg_type == "get_history":
            await self.handle_get_history(msg["room"], ws)

    async def handle_subscribe(
        self, subscriber_id: str, room: str, user_id: str, ws: WebSocketServerProtocol
    ):
        if subscriber_id not in self.subscribers:
            self.subscribers[subscriber_id] = {
                "id": subscriber_id,
                "userId": user_id,
                "rooms": set(),
                "ws": ws,
            }

        subscriber = self.subscribers[subscriber_id]
        subscriber["rooms"].add(room)

        # Initialize room state
        if room not in self.room_sequences:
            self.room_sequences[room] = 0
            self.message_history[room] = []

        # Send presence notification
        await self.broadcast(room, {
            "type": "presence",
            "userId": user_id,
            "action": "join",
            "timestamp": int(time.time() * 1000),
        })

        # Send current sequence number
        await ws.send(json.dumps({
            "type": "subscribed",
            "room": room,
            "sequence": self.room_sequences[room],
        }))

        print(f"{user_id} subscribed to {room}")

    async def handle_unsubscribe(self, subscriber_id: str, room: str):
        subscriber = self.subscribers.get(subscriber_id)
        if subscriber:
            subscriber["rooms"].discard(room)

            await self.broadcast(room, {
                "type": "presence",
                "userId": subscriber["userId"],
                "action": "leave",
                "timestamp": int(time.time() * 1000),
            })

    async def handle_publish(self, msg: dict):
        room = msg["room"]
        sender = msg["sender"]
        content = msg["content"]

        sequence = self.room_sequences.get(room, 0) + 1
        self.room_sequences[room] = sequence

        message = Message(
            id=self._generate_id(),
            room=room,
            sender=sender,
            content=content,
            sequence=sequence,
            timestamp=int(time.time() * 1000),
        )

        # Store in history
        history = self.message_history[room]
        history.append(message)
        self.message_history[room] = history[-100:]  # Keep last 100

        # Broadcast
        await self.broadcast(room, {
            "type": "message",
            **asdict(message),
        })

    async def handle_get_history(self, room: str, ws: WebSocketServerProtocol):
        history = self.message_history.get(room, [])
        await ws.send(json.dumps({
            "type": "history",
            "room": room,
            "messages": [asdict(m) for m in history],
        }))

    async def broadcast(self, room: str, payload: dict):
        payload_str = json.dumps(payload)
        tasks = []

        for subscriber in self.subscribers.values():
            if room in subscriber["rooms"]:
                ws = subscriber["ws"]
                if not ws.closed:
                    tasks.append(ws.send(payload_str))

        if tasks:
            await asyncio.gather(*tasks, return_exceptions=True)

    async def handle_disconnect(self, subscriber_id: str):
        subscriber = self.subscribers.get(subscriber_id)
        if subscriber:
            # Notify all rooms
            for room in list(subscriber["rooms"]):
                await self.broadcast(room, {
                    "type": "presence",
                    "userId": subscriber["userId"],
                    "action": "leave",
                    "timestamp": int(time.time() * 1000),
                })

            del self.subscribers[subscriber_id]

    def _generate_id(self) -> str:
        import random
        import string
        return ''.join(random.choices(string.ascii_lowercase + string.digits, k=12))

    async def start(self):
        print(f"Pub/Sub server running on port {self.port}")
        async with websockets.serve(self.handle_connection, "", self.port):
            await asyncio.Future()  # Run forever

if __name__ == "__main__":
    import os
    port = int(os.environ.get("PORT", "8080"))
    server = PubSubServer(port)
    asyncio.run(server.start())

pubsub_client.py - Client with ordering buffer:

# src: examples/03-chat/py/pubsub_client.py

import asyncio
import json
import time
from typing import Dict, Optional
import websockets
from websockets.client import WebSocketClientProtocol

class PubSubClient:
    def __init__(self, url: str, user_id: Optional[str] = None):
        self.url = url
        self.user_id = user_id or f"user-{int(time.time())}"
        self.ws: Optional[WebSocketClientProtocol] = None
        self.message_buffer: Dict[str, Dict[int, dict]] = {}
        self.expected_sequence: Dict[str, int] = {}
        self.reconnect_attempts = 0
        self.max_reconnect_attempts = 5

    async def connect(self):
        try:
            self.ws = await websockets.connect(self.url)
            print(f"Connected as {self.user_id}")
            self.reconnect_attempts = 0
            asyncio.create_task(self.listen())
        except Exception as e:
            print(f"Connection failed: {e}")
            await self.reconnect()

    async def listen(self):
        if not self.ws:
            return

        try:
            async for message in self.ws:
                data = json.loads(message)
                await self.handle_message(data)
        except websockets.exceptions.ConnectionClosed:
            print("Disconnected. Attempting to reconnect...")
            await self.reconnect()

    async def handle_message(self, msg: dict):
        msg_type = msg.get("type")

        if msg_type == "subscribed":
            room = msg["room"]
            self.expected_sequence[room] = msg.get("sequence", 0) + 1
            print(f"Subscribed to {room} at sequence {msg.get('sequence', 0)}")

        elif msg_type == "message":
            await self.handle_ordered_message(msg["room"], msg)

        elif msg_type == "presence":
            print(f"{msg['userId']} {msg['action']}ed")

        elif msg_type == "history":
            print(f"Received {len(msg['messages'])} historical messages")
            for m in msg["messages"]:
                self.display_message(m)

    async def handle_ordered_message(self, room: str, msg: dict):
        seq = msg["sequence"]

        if room not in self.message_buffer:
            self.message_buffer[room] = {}

        buffer = self.message_buffer[room]
        expected = self.expected_sequence.get(room, 1)

        if seq == expected:
            # Expected message - deliver immediately
            self.display_message(msg)
            self.expected_sequence[room] = seq + 1

            # Check buffer for next messages
            await self.deliver_buffered_messages(room)

        elif seq > expected:
            # Future message - buffer it
            buffer[seq] = msg
            print(f"Buffered message {seq} (expecting {expected})")

    async def deliver_buffered_messages(self, room: str):
        buffer = self.message_buffer.get(room, {})
        expected = self.expected_sequence.get(room, 1)

        while expected in buffer:
            msg = buffer[expected]
            self.display_message(msg)
            del buffer[expected]
            self.expected_sequence[room] = expected + 1
            expected += 1

    def display_message(self, msg: dict):
        print(f"[{msg['sequence']}] {msg['sender']}: {msg['content']}")

    async def subscribe(self, room: str):
        await self.send({"type": "subscribe", "room": room, "userId": self.user_id})

    async def unsubscribe(self, room: str):
        await self.send({"type": "unsubscribe", "room": room})

    async def publish(self, room: str, content: str):
        await self.send({
            "type": "publish",
            "room": room,
            "sender": self.user_id,
            "content": content,
        })

    async def get_history(self, room: str):
        await self.send({"type": "get_history", "room": room})

    async def send(self, payload: dict):
        if self.ws and not self.ws.closed:
            await self.ws.send(json.dumps(payload))
        else:
            print("WebSocket not connected")

    async def reconnect(self):
        if self.reconnect_attempts < self.max_reconnect_attempts:
            self.reconnect_attempts += 1
            delay = min(1000 * (2 ** self.reconnect_attempts), 30000) / 1000
            await asyncio.sleep(delay)
            await self.connect()
        else:
            print("Max reconnection attempts reached")

async def main():
    import sys
    url = sys.argv[1] if len(sys.argv) > 1 else "ws://localhost:8080"
    client = PubSubClient(url)
    await client.connect()

    # Simple CLI
    current_room = ""

    print('Commands: /join <room>, /leave <room>, /history <room>, /quit')

    while True:
        try:
            prompt = f"[{current_room}]> " if current_room else "(no room)> "
            line = await asyncio.get_event_loop().run_in_executor(None, input, prompt)

            if line == "/quit":
                break
            elif line.startswith("/join "):
                current_room = line[6:]
                await client.subscribe(current_room)
            elif line.startswith("/leave "):
                room = line[7:]
                await client.unsubscribe(room)
                if room == current_room:
                    current_room = ""
            elif line.startswith("/history "):
                room = line[9:]
                await client.get_history(room)
            elif line and current_room:
                await client.publish(current_room, line)

        except EOFError:
            break

    if client.ws:
        await client.ws.close()

if __name__ == "__main__":
    asyncio.run(main())

Running the Examples

TypeScript Version

cd distributed-systems-course/examples/03-chat/ts

# Install dependencies
npm install

# Start the server
PORT=8080 npx ts-node pubsub-server.ts

# In another terminal, start a client
npx ts-node pubsub-client.ts

Python Version

cd distributed-systems-course/examples/03-chat/py

# Install dependencies
pip install -r requirements.txt

# Start the server
PORT=8080 python pubsub_server.py

# In another terminal, start a client
python pubsub_client.py

Docker Compose

docker-compose.yml (TypeScript):

services:
  pubsub-server:
    build: .
    ports:
      - "8080:8080"
    environment:
      - PORT=8080
docker-compose up

Testing the Pub/Sub System

Test 1: Basic Pub/Sub

  1. Start three clients in separate terminals
  2. Client 1: /join general
  3. Client 2: /join general
  4. Client 1: Hello everyone!
  5. Client 2 should receive the message
  6. Client 3: /join general
  7. Client 3: /history general - should see previous messages

Test 2: Multiple Rooms

  1. Client 1: /join sports
  2. Client 2: /join news
  3. Client 1: Game starting! (only in sports)
  4. Client 2: Breaking news! (only in news)
  5. Client 3: /join sports and /join news (receives both)

Test 3: Message Ordering

  1. Start a client and join a room
  2. Send messages rapidly: msg1, msg2, msg3
  3. Observe sequence numbers: [1], [2], [3]
  4. Note the order is preserved

Test 4: Presence Tracking

  1. Start two clients
  2. Both join the same room
  3. Observe presence notifications (user joined/left)
  4. Disconnect one client (Ctrl+C)
  5. Other client receives leave notification

Exercises

Exercise 1: Implement Last-Message Cache

Add a feature to store only the last N messages per room (already implemented as 100 in the code).

Tasks:

  • Make the history size configurable via environment variable
  • Add a /clear_history command for admins
  • Add TTL (time-to-live) for old messages

Exercise 2: Implement Private Messages

Extend the system to support direct messages between users.

Requirements:

  • Private messages should only be delivered to the recipient
  • Use a special topic format: @username
  • Include sender authentication

Hint: You'll need to modify the handlePublish method to check for @ prefix.

Exercise 3: Add Message Acknowledgments

Implement acknowledgments to guarantee message delivery.

Requirements:

  • Clients must ACK received messages
  • Server tracks unacknowledged messages
  • On reconnect, server resends unacknowledged messages

Hint: Add an ack message type and track pending messages per subscriber.

Common Pitfalls

PitfallSymptomSolution
Sequence number desyncMessages not displayedRe-subscribe to reset sequence
Memory leak from historyGrowing memory usageImplement history size limits
Missing presence updatesStale online statusAdd heartbeat/ping messages
Race conditionsMessages lost during reconnectBuffer messages during disconnection

Real-World Examples

SystemPub/Sub ImplementationOrdering Strategy
Redis Pub/SubTopic-based channelsNo ordering guarantees
Apache KafkaPartitioned topicsPer-partition ordering
Google Cloud Pub/SubTopic-based with subscriptionsExactly-once delivery
AWS SNSTopic-based fanoutBest-effort ordering
RabbitMQExchange/queue bindingFIFO within queue

Summary

  • Pub/Sub decouples publishers from subscribers through an intermediary broker
  • Topic-based routing is the simplest and most common pattern
  • Presence tracking enables online/offline status in real-time systems
  • Message ordering requires sequence numbers and buffering
  • Causal ordering is achievable with modest complexity
  • Total ordering is expensive and often unnecessary

Next: Chat System Implementation →

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

Chat System Implementation

Session 7 - Full session (90 minutes)

Learning Objectives

  • Build a complete real-time chat system with WebSockets
  • Implement message ordering with sequence numbers
  • Handle presence management (online/offline users)
  • Add message persistence for history
  • Deploy multiple chat rooms using Docker Compose

System Architecture

Our chat system brings together all the concepts from Sessions 6-7:

graph TB
    subgraph "Clients"
        C1[User 1 Browser]
        C2[User 2 Browser]
        C3[User 3 Browser]
    end

    subgraph "Chat Server"
        WS[WebSocket Handler]
        PS[Pub/Sub Engine]
        SM[Sequence Manager]
        PM[Presence Manager]
        MS[Message Store]

        WS --> PS
        WS --> SM
        WS --> PM
        PS --> SM
        SM --> MS
    end

    C1 -->|WebSocket| WS
    C2 -->|WebSocket| WS
    C3 -->|WebSocket| WS

    subgraph "Persistence"
        DB[(Messages DB)]
    end

    MS --> DB

    style WS fill:#e3f2fd
    style PS fill:#fff3e0
    style SM fill:#f3e5f5

Key Components

ComponentResponsibility
WebSocket HandlerManages client connections, sends/receives messages
Pub/Sub EngineRoutes messages to rooms, handles subscriptions
Sequence ManagerAssigns sequence numbers, ensures ordering
Presence ManagerTracks online/offline status, heartbeat
Message StorePersists messages for history and replay

Message Flow

sequenceDiagram
    participant U1 as User 1
    participant WS as WebSocket Handler
    participant PS as Pub/Sub
    participant SM as Sequencer
    participant DB as Message Store
    participant U2 as User 2

    U1->>WS: CONNECT("general")
    WS->>PS: subscribe("general", U1)
    WS->>PM: mark_online(U1)
    PS->>U2: BROADCAST("User 1 joined")

    Note over U1,U2: Sending a message
    U1->>WS: SEND("general", "Hello!")
    WS->>PS: publish("general", msg)
    PS->>SM: get_sequence(msg)
    SM->>DB: save(msg, seq=1)
    SM->>PS: return seq=1
    PS->>U1: BROADCAST(msg, seq=1)
    PS->>U2: BROADCAST(msg, seq=1)

    Note over U1,U2: User 2 reconnects
    U2->>WS: CONNECT("general", last_seq=0)
    WS->>DB: get_messages(since=0)
    DB->>U2: REPLAY([msg1, msg2, ...])

TypeScript Implementation

Project Structure

chat-system/
├── package.json
├── tsconfig.json
├── src/
│   ├── types.ts          # Type definitions
│   ├── pub-sub.ts        # Pub/Sub engine
│   ├── sequencer.ts      # Sequence number manager
│   ├── presence.ts       # Presence management
│   ├── store.ts          # Message persistence
│   ├── server.ts         # WebSocket server
│   └── index.ts          # Entry point
├── public/
│   └── client.html       # Demo client
├── Dockerfile
└── docker-compose.yml

1. Type Definitions

// src/types.ts
export interface Message {
    id: string;
    room: string;
    user: string;
    content: string;
    sequence: number;
    timestamp: number;
}

export interface Client {
    id: string;
    user: string;
    rooms: Set<string>;
    ws: WebSocket;
    lastSeen: number;
}

export interface Presence {
    user: string;
    status: 'online' | 'offline' | 'away';
    lastSeen: number;
}

export type MessageHandler = (client: Client, message: Message) => void;

2. Pub/Sub Engine

// src/pub-sub.ts
import { Message, Client, MessageHandler } from './types';

export class PubSub {
    private subscriptions: Map<string, Set<Client>> = new Map();
    private handlers: Map<string, MessageHandler[]> = new Map();

    subscribe(room: string, client: Client): void {
        if (!this.subscriptions.has(room)) {
            this.subscriptions.set(room, new Set());
        }
        this.subscriptions.get(room)!.add(client);
        client.rooms.add(room);
    }

    unsubscribe(room: string, client: Client): void {
        const subs = this.subscriptions.get(room);
        if (subs) {
            subs.delete(client);
            if (subs.size === 0) {
                this.subscriptions.delete(room);
            }
        }
        client.rooms.delete(room);
    }

    publish(room: string, message: Message): void {
        const subs = this.subscriptions.get(room);
        if (subs) {
            for (const client of subs) {
                this.sendToClient(client, message);
            }
        }
        this.emit('message', message);
    }

    on(event: string, handler: MessageHandler): void {
        if (!this.handlers.has(event)) {
            this.handlers.set(event, []);
        }
        this.handlers.get(event)!.push(handler);
    }

    private emit(event: string, message: Message): void {
        const handlers = this.handlers.get(event) || [];
        handlers.forEach(h => h(null!, message));
    }

    private sendToClient(client: Client, message: Message): void {
        if (client.ws.readyState === client.ws.OPEN) {
            client.ws.send(JSON.stringify({
                type: 'message',
                data: message
            }));
        }
    }

    getSubscribers(room: string): Client[] {
        return Array.from(this.subscriptions.get(room) || []);
    }

    getRooms(): string[] {
        return Array.from(this.subscriptions.keys());
    }
}

3. Sequence Manager

// src/sequencer.ts
import { Message } from './types';

export class Sequencer {
    private sequences: Map<string, number> = new Map();

    getNext(room: string): number {
        const current = this.sequences.get(room) || 0;
        const next = current + 1;
        this.sequences.set(room, next);
        return next;
    }

    setCurrent(room: string, sequence: number): void {
        this.sequences.set(room, sequence);
    }

    getCurrent(room: string): number {
        return this.sequences.get(room) || 0;
    }

    sequenceMessage(message: Message): Message {
        const seq = this.getNext(message.room);
        return { ...message, sequence: seq };
    }
}

4. Presence Manager

// src/presence.ts
import { Client, Presence } from './types';

const HEARTBEAT_INTERVAL = 30000; // 30 seconds
const OFFLINE_TIMEOUT = 60000; // 60 seconds

export class PresenceManager {
    private users: Map<string, Presence> = new Map();
    private clients: Map<string, Client> = new Map();
    private intervals: Map<string, NodeJS.Timeout> = new Map();

    register(client: Client): void {
        this.clients.set(client.id, client);
        this.updatePresence(client.user, 'online');
        this.startHeartbeat(client);
    }

    unregister(client: Client): void {
        this.stopHeartbeat(client);
        this.clients.delete(client.id);
        this.updatePresence(client.user, 'offline');
    }

    updatePresence(user: string, status: 'online' | 'offline' | 'away'): void {
        this.users.set(user, {
            user,
            status,
            lastSeen: Date.now()
        });
    }

    getPresence(user: string): Presence | undefined {
        return this.users.get(user);
    }

    getOnlineUsers(): string[] {
        const now = Date.now();
        return Array.from(this.users.values())
            .filter(p => p.status === 'online' && (now - p.lastSeen) < OFFLINE_TIMEOUT)
            .map(p => p.user);
    }

    getPresenceInRoom(room: string): Presence[] {
        const now = Date.now();
        const usersInRoom = new Set<string>();

        for (const client of this.clients.values()) {
            if (client.rooms.has(room)) {
                usersInRoom.add(client.user);
            }
        }

        return Array.from(usersInRoom)
            .map(user => this.users.get(user)!)
            .filter(p => p && (now - p.lastSeen) < OFFLINE_TIMEOUT);
    }

    private startHeartbeat(client: Client): void {
        const interval = setInterval(() => {
            if (client.ws.readyState === client.ws.OPEN) {
                client.ws.send(JSON.stringify({ type: 'heartbeat' }));
                this.updatePresence(client.user, 'online');
            }
        }, HEARTBEAT_INTERVAL);

        this.intervals.set(client.id, interval);
    }

    private stopHeartbeat(client: Client): void {
        const interval = this.intervals.get(client.id);
        if (interval) {
            clearInterval(interval);
            this.intervals.delete(client.id);
        }
    }

    cleanup(): void {
        for (const interval of this.intervals.values()) {
            clearInterval(interval);
        }
        this.intervals.clear();
    }
}

5. Message Store

// src/store.ts
import { Message } from './types';
import fs from 'fs/promises';
import path from 'path';

export class MessageStore {
    private basePath: string;

    constructor(basePath: string = './data/messages') {
        this.basePath = basePath;
    }

    async save(message: Message): Promise<void> {
        const roomPath = path.join(this.basePath, message.room);
        await fs.mkdir(roomPath, { recursive: true });

        const filename = path.join(roomPath, `${message.sequence}.json`);
        await fs.writeFile(filename, JSON.stringify(message, null, 2));
    }

    async getMessages(room: string, since: number = 0, limit: number = 100): Promise<Message[]> {
        const roomPath = path.join(this.basePath, room);
        const messages: Message[] = [];

        try {
            const files = await fs.readdir(roomPath);
            const jsonFiles = files
                .filter(f => f.endsWith('.json'))
                .map(f => parseInt(f.replace('.json', '')))
                .filter(seq => seq > since)
                .sort((a, b) => a - b)
                .slice(0, limit);

            for (const seq of jsonFiles) {
                const content = await fs.readFile(path.join(roomPath, `${seq}.json`), 'utf-8');
                messages.push(JSON.parse(content));
            }
        } catch (err) {
            // Room doesn't exist yet
        }

        return messages;
    }

    async getLastSequence(room: string): Promise<number> {
        const roomPath = path.join(this.basePath, room);
        try {
            const files = await fs.readdir(roomPath);
            const sequences = files
                .filter(f => f.endsWith('.json'))
                .map(f => parseInt(f.replace('.json', '')));

            return sequences.length > 0 ? Math.max(...sequences) : 0;
        } catch {
            return 0;
        }
    }
}

6. WebSocket Server

// src/server.ts
import { WebSocketServer, WebSocket } from 'ws';
import { createServer } from 'http';
import { v4 as uuidv4 } from 'uuid';
import { PubSub } from './pub-sub';
import { Sequencer } from './sequencer';
import { PresenceManager } from './presence';
import { MessageStore } from './store';
import { Client, Message } from './types';

const PORT = process.env.PORT || 8080;

export class ChatServer {
    private wss: WebSocketServer;
    private pubSub: PubSub;
    private sequencer: Sequencer;
    private presence: PresenceManager;
    private store: MessageStore;

    constructor() {
        const server = createServer();
        this.wss = new WebSocketServer({ server });
        this.pubSub = new PubSub();
        this.sequencer = new Sequencer();
        this.presence = new PresenceManager();
        this.store = new MessageStore();

        this.setupHandlers();
    }

    private setupHandlers(): void {
        this.wss.on('connection', (ws: WebSocket) => {
            const clientId = uuidv4();
            const client: Client = {
                id: clientId,
                user: `user_${clientId.slice(0, 8)}`,
                rooms: new Set(),
                ws,
                lastSeen: Date.now()
            };

            console.log(`Client connected: ${client.id}`);

            ws.on('message', async (data: string) => {
                try {
                    const msg = JSON.parse(data);
                    await this.handleMessage(client, msg);
                } catch (err) {
                    console.error('Error handling message:', err);
                }
            });

            ws.on('close', () => {
                console.log(`Client disconnected: ${client.id}`);
                for (const room of client.rooms) {
                    this.pubSub.publish(room, {
                        id: uuidv4(),
                        room,
                        user: 'system',
                        content: `${client.user} left the room`,
                        sequence: this.sequencer.getCurrent(room),
                        timestamp: Date.now()
                    });
                    this.pubSub.unsubscribe(room, client);
                }
                this.presence.unregister(client);
            });

            // Send welcome message
            this.sendToClient(client, {
                type: 'connected',
                data: { clientId: client.id, user: client.user }
            });

            this.presence.register(client);
        });
    }

    private async handleMessage(client: Client, msg: any): Promise<void> {
        switch (msg.type) {
            case 'join':
                await this.handleJoin(client, msg.room);
                break;
            case 'leave':
                this.handleLeave(client, msg.room);
                break;
            case 'message':
                await this.handleChatMessage(client, msg.data);
                break;
            case 'presence':
                this.handlePresenceRequest(client, msg.room);
                break;
            case 'history':
                await this.handleHistoryRequest(client, msg.room, msg.since);
                break;
            default:
                console.log('Unknown message type:', msg.type);
        }
    }

    private async handleJoin(client: Client, room: string): Promise<void> {
        console.log(`${client.user} joining room: ${room}`);

        // Subscribe to room
        this.pubSub.subscribe(room, client);

        // Send current presence
        const presence = this.presence.getPresenceInRoom(room);
        this.sendToClient(client, {
            type: 'presence',
            data: { room, users: presence }
        });

        // Announce join
        this.pubSub.publish(room, {
            id: uuidv4(),
            room,
            user: 'system',
            content: `${client.user} joined the room`,
            sequence: this.sequencer.getCurrent(room),
            timestamp: Date.now()
        });

        // Send recent messages
        const history = await this.store.getMessages(room, 0, 50);
        if (history.length > 0) {
            this.sendToClient(client, {
                type: 'history',
                data: { room, messages: history }
            });
        }
    }

    private handleLeave(client: Client, room: string): void {
        console.log(`${client.user} leaving room: ${room}`);
        this.pubSub.unsubscribe(room, client);

        this.pubSub.publish(room, {
            id: uuidv4(),
            room,
            user: 'system',
            content: `${client.user} left the room`,
            sequence: this.sequencer.getCurrent(room),
            timestamp: Date.now()
        });
    }

    private async handleChatMessage(client: Client, data: any): Promise<void> {
        const { room, content } = data;

        if (!client.rooms.has(room)) {
            this.sendError(client, 'Not subscribed to room');
            return;
        }

        const message: Message = {
            id: uuidv4(),
            room,
            user: client.user,
            content,
            sequence: 0, // Will be assigned
            timestamp: Date.now()
        };

        // Assign sequence number
        const sequenced = this.sequencer.sequenceMessage(message);

        // Save to store
        await this.store.save(sequenced);

        // Publish to all subscribers
        this.pubSub.publish(room, sequenced);

        console.log(`[${room}] ${client.user}: ${content} (seq: ${sequenced.sequence})`);
    }

    private handlePresenceRequest(client: Client, room: string): void {
        const presence = this.presence.getPresenceInRoom(room);
        this.sendToClient(client, {
            type: 'presence',
            data: { room, users: presence }
        });
    }

    private async handleHistoryRequest(client: Client, room: string, since: number = 0): Promise<void> {
        const messages = await this.store.getMessages(room, since);
        this.sendToClient(client, {
            type: 'history',
            data: { room, messages }
        });
    }

    private sendToClient(client: Client, data: any): void {
        if (client.ws.readyState === client.ws.OPEN) {
            client.ws.send(JSON.stringify(data));
        }
    }

    private sendError(client: Client, message: string): void {
        this.sendToClient(client, {
            type: 'error',
            data: { message }
        });
    }

    listen(): void {
        const server = this.wss.server!;
        server.listen(PORT, () => {
            console.log(`Chat server listening on port ${PORT}`);
        });
    }
}

7. Entry Point

// src/index.ts
import { ChatServer } from './server';

const server = new ChatServer();
server.listen();

8. Package.json

{
  "name": "chat-system",
  "version": "1.0.0",
  "description": "Real-time chat system with WebSockets",
  "main": "dist/index.js",
  "scripts": {
    "build": "tsc",
    "start": "node dist/index.js",
    "dev": "ts-node src/index.ts"
  },
  "dependencies": {
    "ws": "^8.18.0",
    "uuid": "^11.0.3"
  },
  "devDependencies": {
    "@types/node": "^22.10.2",
    "@types/ws": "^8.5.13",
    "@types/uuid": "^10.0.0",
    "ts-node": "^10.9.2",
    "typescript": "^5.7.2"
  }
}

9. Dockerfile

FROM node:20-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

EXPOSE 8080

CMD ["npm", "start"]

10. Docker Compose

version: '3.8'

services:
  chat:
    build: .
    ports:
      - "8080:8080"
    volumes:
      - ./data:/app/data
    environment:
      - PORT=8080
    restart: unless-stopped

Python Implementation

Project Structure

chat-system/
├── requirements.txt
├── src/
│   ├── __init__.py
│   ├── types.py
│   ├── pub_sub.py
│   ├── sequencer.py
│   ├── presence.py
│   ├── store.py
│   ├── server.py
│   └── main.py
├── public/
│   └── client.html
├── Dockerfile
└── docker-compose.yml

1. Type Definitions

# src/types.py
from dataclasses import dataclass, field
from typing import Set
import websockets.server
import datetime

@dataclass
class Message:
    id: str
    room: str
    user: str
    content: str
    sequence: int
    timestamp: float

@dataclass
class Client:
    id: str
    user: str
    rooms: Set[str] = field(default_factory=set)
    websocket: websockets.server.WebSocketServerProtocol = None
    last_seen: float = field(default_factory=lambda: datetime.datetime.now().timestamp())

@dataclass
class Presence:
    user: str
    status: str  # 'online', 'offline', 'away'
    last_seen: float

2. Pub/Sub Engine

# src/pub_sub.py
from typing import Dict, Set, List, Callable, Any
from .types import Message, Client

class PubSub:
    def __init__(self):
        self.subscriptions: Dict[str, Set[Client]] = {}
        self.handlers: Dict[str, List[Callable]] = {}

    def subscribe(self, room: str, client: Client) -> None:
        if room not in self.subscriptions:
            self.subscriptions[room] = set()
        self.subscriptions[room].add(client)
        client.rooms.add(room)

    def unsubscribe(self, room: str, client: Client) -> None:
        if room in self.subscriptions:
            self.subscriptions[room].discard(client)
            if not self.subscriptions[room]:
                del self.subscriptions[room]
        client.rooms.discard(room)

    async def publish(self, room: str, message: Message) -> None:
        if room in self.subscriptions:
            for client in self.subscriptions[room]:
                await self._send_to_client(client, message)
        await self._emit('message', message)

    async def _send_to_client(self, client: Client, message: Message) -> None:
        if client.websocket and not client.websocket.closed:
            import json
            await client.websocket.send(json.dumps({
                'type': 'message',
                'data': message.__dict__
            }))

    async def _emit(self, event: str, message: Message) -> None:
        handlers = self.handlers.get(event, [])
        for handler in handlers:
            await handler(None, message)

    def get_subscribers(self, room: str) -> List[Client]:
        return list(self.subscriptions.get(room, set()))

    def get_rooms(self) -> List[str]:
        return list(self.subscriptions.keys())

3. Sequence Manager

# src/sequencer.py
from typing import Dict
from .types import Message

class Sequencer:
    def __init__(self):
        self.sequences: Dict[str, int] = {}

    def get_next(self, room: str) -> int:
        current = self.sequences.get(room, 0)
        next_seq = current + 1
        self.sequences[room] = next_seq
        return next_seq

    def set_current(self, room: str, sequence: int) -> None:
        self.sequences[room] = sequence

    def get_current(self, room: str) -> int:
        return self.sequences.get(room, 0)

    def sequence_message(self, message: Message) -> Message:
        seq = self.get_next(message.room)
        message.sequence = seq
        return message

4. Presence Manager

# src/presence.py
import asyncio
import datetime
from typing import Dict, List, Set
from .types import Client, Presence

HEARTBEAT_INTERVAL = 30  # seconds
OFFLINE_TIMEOUT = 60  # seconds

class PresenceManager:
    def __init__(self):
        self.users: Dict[str, Presence] = {}
        self.clients: Dict[str, Client] = {}
        self.tasks: Dict[str, asyncio.Task] = {}

    def register(self, client: Client) -> None:
        self.clients[client.id] = client
        self.update_presence(client.user, 'online')
        self.tasks[client.id] = asyncio.create_task(self._heartbeat(client))

    def unregister(self, client: Client) -> None:
        if client.id in self.tasks:
            self.tasks[client.id].cancel()
            del self.tasks[client.id]
        if client.id in self.clients:
            del self.clients[client.id]
        self.update_presence(client.user, 'offline')

    def update_presence(self, user: str, status: str) -> None:
        self.users[user] = Presence(
            user=user,
            status=status,
            last_seen=datetime.datetime.now().timestamp()
        )

    def get_presence(self, user: str) -> Presence | None:
        return self.users.get(user)

    def get_online_users(self) -> List[str]:
        now = datetime.datetime.now().timestamp()
        return [
            p.user for p in self.users.values()
            if p.status == 'online' and (now - p.last_seen) < OFFLINE_TIMEOUT
        ]

    def get_presence_in_room(self, room: str) -> List[Presence]:
        now = datetime.datetime.now().timestamp()
        users_in_room = set()

        for client in self.clients.values():
            if room in client.rooms:
                users_in_room.add(client.user)

        return [
            self.users.get(user)
            for user in users_in_room
            if user in self.users and (now - self.users[user].last_seen) < OFFLINE_TIMEOUT
        ]

    async def _heartbeat(self, client: Client) -> None:
        import json
        while True:
            try:
                if client.websocket and not client.websocket.closed:
                    await client.websocket.send(json.dumps({'type': 'heartbeat'}))
                    self.update_presence(client.user, 'online')
            except asyncio.CancelledError:
                break
            except Exception:
                pass
            await asyncio.sleep(HEARTBEAT_INTERVAL)

    def cleanup(self) -> None:
        for task in self.tasks.values():
            task.cancel()
        self.tasks.clear()

5. Message Store

# src/store.py
import os
import json
import asyncio
from pathlib import Path
from typing import List
from .types import Message

class MessageStore:
    def __init__(self, base_path: str = './data/messages'):
        self.base_path = Path(base_path)

    async def save(self, message: Message) -> None:
        room_path = self.base_path / message.room
        room_path.mkdir(parents=True, exist_ok=True)

        filename = room_path / f'{message.sequence}.json'
        with open(filename, 'w') as f:
            json.dump(message.__dict__, f, indent=2)

    async def get_messages(self, room: str, since: int = 0, limit: int = 100) -> List[Message]:
        room_path = self.base_path / room
        messages = []

        if not room_path.exists():
            return messages

        try:
            files = [f for f in os.listdir(room_path) if f.endswith('.json')]
            sequences = sorted([
                int(f.replace('.json', ''))
                for f in files
                if int(f.replace('.json', '')) > since
            ])[:limit]

            for seq in sequences:
                with open(room_path / f'{seq}.json', 'r') as f:
                    data = json.load(f)
                    messages.append(Message(**data))
        except FileNotFoundError:
            pass

        return messages

    async def get_last_sequence(self, room: str) -> int:
        room_path = self.base_path / room
        if not room_path.exists():
            return 0

        try:
            files = [f for f in os.listdir(room_path) if f.endswith('.json')]
            sequences = [int(f.replace('.json', '')) for f in files]
            return max(sequences) if sequences else 0
        except FileNotFoundError:
            return 0

6. WebSocket Server

# src/server.py
import websockets
import json
import uuid
import asyncio
from typing import Any
from .pub_sub import PubSub
from .sequencer import Sequencer
from .presence import PresenceManager
from .store import MessageStore
from .types import Client, Message

PORT = int(os.getenv('PORT', 8080))

class ChatServer:
    def __init__(self):
        self.pub_sub = PubSub()
        self.sequencer = Sequencer()
        self.presence = PresenceManager()
        self.store = MessageStore()

    async def handle_client(self, websocket, path):
        client_id = str(uuid.uuid4())
        client = Client(
            id=client_id,
            user=f"user_{client_id[:8]}",
            websocket=websocket,
            rooms=set()
        )

        print(f"Client connected: {client.id}")

        await self._send_to_client(client, {
            'type': 'connected',
            'data': {'clientId': client.id, 'user': client.user}
        })

        self.presence.register(client)

        try:
            async for message in websocket:
                msg = json.loads(message)
                await self.handle_message(client, msg)
        except websockets.exceptions.ConnectionClosed:
            print(f"Client disconnected: {client.id}")
        finally:
            for room in list(client.rooms):
                await self.pub_sub.publish(room, Message(
                    id=str(uuid.uuid4()),
                    room=room,
                    user='system',
                    content=f"{client.user} left the room",
                    sequence=self.sequencer.get_current(room),
                    timestamp=asyncio.get_event_loop().time()
                ))
                self.pub_sub.unsubscribe(room, client)
            self.presence.unregister(client)

    async def handle_message(self, client: Client, msg: Any) -> None:
        handlers = {
            'join': self.handle_join,
            'leave': self.handle_leave,
            'message': self.handle_chat_message,
            'presence': self.handle_presence_request,
            'history': self.handle_history_request
        }

        handler = handlers.get(msg.get('type'))
        if handler:
            await handler(client, msg)
        else:
            print(f"Unknown message type: {msg.get('type')}")

    async def handle_join(self, client: Client, msg: Any) -> None:
        room = msg.get('room')
        print(f"{client.user} joining room: {room}")

        self.pub_sub.subscribe(room, client)

        presence = self.presence.get_presence_in_room(room)
        await self._send_to_client(client, {
            'type': 'presence',
            'data': {'room': room, 'users': [p.__dict__ for p in presence]}
        })

        await self.pub_sub.publish(room, Message(
            id=str(uuid.uuid4()),
            room=room,
            user='system',
            content=f"{client.user} joined the room",
            sequence=self.sequencer.get_current(room),
            timestamp=asyncio.get_event_loop().time()
        ))

        history = await self.store.get_messages(room, 0, 50)
        if history:
            await self._send_to_client(client, {
                'type': 'history',
                'data': {'room': room, 'messages': [m.__dict__ for m in history]}
            })

    def handle_leave(self, client: Client, msg: Any) -> None:
        room = msg.get('room')
        print(f"{client.user} leaving room: {room}")
        self.pub_sub.unsubscribe(room, client)

    async def handle_chat_message(self, client: Client, msg: Any) -> None:
        data = msg.get('data', {})
        room = data.get('room')

        if room not in client.rooms:
            await self._send_error(client, 'Not subscribed to room')
            return

        message = Message(
            id=str(uuid.uuid4()),
            room=room,
            user=client.user,
            content=data.get('content', ''),
            sequence=0,
            timestamp=asyncio.get_event_loop().time()
        )

        sequenced = self.sequencer.sequence_message(message)
        await self.store.save(sequenced)
        await self.pub_sub.publish(room, sequenced)

        print(f"[{room}] {client.user}: {sequenced.content} (seq: {sequenced.sequence})")

    async def handle_presence_request(self, client: Client, msg: Any) -> None:
        room = msg.get('room')
        presence = self.presence.get_presence_in_room(room)
        await self._send_to_client(client, {
            'type': 'presence',
            'data': {'room': room, 'users': [p.__dict__ for p in presence]}
        })

    async def handle_history_request(self, client: Client, msg: Any) -> None:
        room = msg.get('room')
        since = msg.get('since', 0)
        messages = await self.store.get_messages(room, since)
        await self._send_to_client(client, {
            'type': 'history',
            'data': {'room': room, 'messages': [m.__dict__ for m in messages]}
        })

    async def _send_to_client(self, client: Client, data: Any) -> None:
        if client.websocket and not client.websocket.closed:
            await client.websocket.send(json.dumps(data))

    async def _send_error(self, client: Client, message: str) -> None:
        await self._send_to_client(client, {
            'type': 'error',
            'data': {'message': message}
        })

    async def start(self):
        print(f"Chat server listening on port {PORT}")
        async with websockets.serve(self.handle_client, "", PORT):
            await asyncio.Future()  # Run forever

7. Entry Point

# src/main.py
import asyncio
import os
from server import ChatServer

async def main():
    server = ChatServer()
    await server.start()

if __name__ == '__main__':
    asyncio.run(main())

8. Requirements

websockets==13.1
aiofiles==24.1.0

9. Dockerfile

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8080

CMD ["python", "src/main.py"]

10. Docker Compose

version: '3.8'

services:
  chat:
    build: .
    ports:
      - "8080:8080"
    volumes:
      - ./data:/app/data
    environment:
      - PORT=8080
    restart: unless-stopped

Running the Chat System

TypeScript

# Install dependencies
npm install

# Build
npm run build

# Start server
npm start

# With Docker Compose
docker-compose up

Python

# Install dependencies
pip install -r requirements.txt

# Start server
python src/main.py

# With Docker Compose
docker-compose up

Exercises

Exercise 1: Basic Chat Operations

  1. Start the chat server
  2. Connect two WebSocket clients
  3. Join the same room
  4. Send messages and verify both clients receive them
  5. Leave the room and verify the broadcast

Exercise 2: Message Ordering

  1. Send multiple messages rapidly from different clients
  2. Verify all messages have unique, sequential sequence numbers
  3. Disconnect and reconnect a client
  4. Request message history and verify ordering is preserved

Exercise 3: Presence Management

  1. Connect multiple clients to different rooms
  2. Join a room and verify presence broadcasts
  3. Simulate a network failure (kill a client without proper leave)
  4. Verify offline detection kicks in after timeout

Exercise 4: Message Persistence

  1. Send messages to a room
  2. Stop the server
  3. Verify messages are saved to disk
  4. Restart the server
  5. Connect a new client and verify it receives message history

Common Pitfalls

IssueCauseFix
Messages not orderedMissing sequence numbersAlways sequence before publishing
Old messages not receivedNot requesting history on joinImplement replay on connect
Presence shows offlineHeartbeat not sentEnsure heartbeat loop is running
Duplicate messagesRe-publishing saved messagesOnly publish new messages, not history

Key Takeaways

  • Pub/Sub enables scalable multi-room communication
  • Sequence numbers guarantee message ordering across all clients
  • Presence management requires both active heartbeats and passive timeout detection
  • Message persistence allows clients to reconnect and receive history
  • Docker Compose simplifies deployment and testing of the complete system

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

What is Consensus?

Session 8 - Full session

Learning Objectives

  • Understand the consensus problem in distributed systems
  • Learn the difference between safety and liveness properties
  • Explore the FLP impossibility result
  • Understand why consensus algorithms are necessary
  • Compare Raft and Paxos approaches

The Consensus Problem

In distributed systems, consensus is the problem of getting multiple nodes to agree on a single value. This sounds simple, but it's fundamental to building reliable distributed systems.

Why Do We Need Consensus?

Consider these scenarios:

  • Leader Election: Multiple nodes need to agree on who is the leader
  • Configuration Changes: All nodes must agree on a new configuration
  • Replicated State Machines: All nodes must apply operations in the same order
  • Distributed Transactions: All participants must agree to commit or abort

Without consensus, distributed systems can suffer from:

  • Split-brain scenarios (multiple leaders)
  • Inconsistent state across nodes
  • Data corruption from conflicting writes
  • Unavailable systems during network partitions
graph LR
    subgraph "Without Consensus"
        N1[Node A: value=1]
        N2[Node B: value=2]
        N3[Node C: value=3]
        N1 --- N2 --- N3
        Problem[Which value is correct?]
    end

    subgraph "With Consensus"
        A1[Node A: value=2]
        A2[Node B: value=2]
        A3[Node C: value=2]
        A1 --- A2 --- A3
        Solved[All nodes agree]
    end

Formal Definition

The consensus problem requires a system to satisfy these properties:

1. Agreement (Safety)

All correct nodes must agree on the same value.

If node A outputs v and node B outputs v', then v = v'

2. Validity

If all correct nodes propose the same value v, then all correct nodes decide v.

The decided value must have been proposed by some node

3. Termination (Liveness)

All correct nodes eventually decide on some value.

The algorithm must make progress, not run forever

4. Integrity

Each node decides at most once.

A node cannot change its decision after deciding


Safety vs Liveness

Understanding the trade-off between safety and liveness is crucial for distributed systems:

graph TB
    subgraph "Safety Properties"
        S1[Agreement]
        S2[Validity]
        S3[Integrity]
        style S1 fill:#90EE90
        style S2 fill:#90EE90
        style S3 fill:#90EE90
    end

    subgraph "Liveness Properties"
        L1[Termination]
        L2[Progress]
        style L1 fill:#FFB6C1
        style L2 fill:#FFB6C1
    end

    Safety["Nothing bad happens<br/>State is always consistent"]
    Liveness["Something good happens<br/>System makes progress"]

    S1 & S2 & S3 --> Safety
    L1 & L2 --> Liveness

    Safety --> Tradeoff["In networks,<br/>you can't guarantee both<br/>during partitions"]
    Liveness --> Tradeoff
SafetyLiveness
"Nothing bad happens""Something good happens"
State is always validSystem makes progress
No corruption, no conflictsOperations complete eventually
Can be maintained during partitionsMay be sacrificed during partitions

Example: During a network partition (CAP theorem), a CP system maintains safety (no inconsistent writes) but sacrifices liveness (writes may be rejected). An AP system maintains liveness (writes succeed) but may sacrifice safety (temporary inconsistencies).


Why Consensus is Hard

Challenge 1: No Global Clock

Nodes don't share a synchronized clock, making it hard to order events:

sequenceDiagram
    participant A as Node A (t=10:00:01)
    participant B as Node B (t=10:00:05)
    participant C as Node C (t=10:00:03)

    Note over A: A proposes value=1
    A->>B: send(value=1)
    Note over B: B receives at t=10:00:07

    Note over C: C proposes value=2
    C->>B: send(value=2)
    Note over B: B receives at t=10:00:08

    Note over B: Which value came first?

Challenge 2: Message Loss and Delays

Messages can be lost, delayed, or reordered:

stateDiagram-v2
    [*] --> Sent: Node sends message
    Sent --> Delivered: Message arrives
    Sent --> Lost: Message lost
    Sent --> Delayed: Network slow
    Delayed --> Delivered: Eventually arrives
    Lost --> Retry: Node resends
    Delivered --> [*]

Challenge 3: Node Failures

Nodes can crash at any time, potentially while holding critical information:

graph TB
    subgraph "Cluster State"
        N1[Node 1: Alive]
        N2[Node 2: CRASHED<br/>Had uncommitted data]
        N3[Node 3: Alive]
        N4[Node 4: Alive]

        N1 --- N2
        N2 --- N3
        N3 --- N4
    end

    Q[What happens to<br/>Node 2's data?]

The FLP Impossibility Result

In 1985, Fischer, Lynch, and Paterson proved the FLP Impossibility Result:

In an asynchronous network, even with only one faulty node, no deterministic consensus algorithm can guarantee safety, liveness, and termination.

What This Means

graph TB
    A[Asynchronous Network] --> B[No timing assumptions]
    B --> C[Messages can take arbitrarily long]
    C --> D[Cannot distinguish slow node from crashed node]
    D --> E[Cannot guarantee termination]
    E --> F[FLP: Consensus impossible<br/>in pure async systems]

How We Work Around It

Real systems handle FLP by relaxing some assumptions:

  1. Partial Synchrony: Assume networks are eventually synchronous
  2. Randomization: Use randomized algorithms (e.g., randomized election timeouts)
  3. Failure Detectors: Use unreliable failure detectors
  4. Timeouts: Assume messages arrive within some time bound

Key Insight: Raft works in "partially synchronous" systems—networks may behave asynchronously for a while, but eventually become synchronous.


Real-World Consensus Scenarios

Scenario 1: Distributed Configuration

All nodes must agree on cluster membership:

sequenceDiagram
    autonumber
    participant N1 as Node 1
    participant N2 as Node 2
    participant N3 as Node 3
    participant N4 as New Node

    N4->>N1: Request to join
    N1->>N2: Propose add Node 4
    N1->>N3: Propose add Node 4

    N2->>N1: Vote YES
    N3->>N1: Vote YES

    N1->>N2: Commit: add Node 4
    N1->>N3: Commit: add Node 4
    N1->>N4: You're in!

    Note over N1,N4: All nodes now agree<br/>cluster has 4 members

Scenario 2: Replicated State Machine

All replicas must apply operations in the same order:

graph LR
    C[Client] --> L[Leader]

    subgraph "Replicated Log"
        L1[Leader: SET x=1]
        F1[Follower 1: SET x=1]
        F2[Follower 2: SET x=1]
        F3[Follower 3: SET x=1]
        L1 --- F1 --- F2 --- F3
    end

    subgraph "State Machine"
        S1[Leader: x=1]
        S2[Follower 1: x=1]
        S3[Follower 2: x=1]
        S4[Follower 3: x=1]
    end

    L --> L1
    F1 --> S2
    F2 --> S3
    F3 --> S4

Consensus Algorithms: Raft vs Paxos

Paxos (1998)

Paxos was the first practical consensus algorithm, but it's notoriously difficult to understand:

Phase 1a (Prepare):  Proposer chooses proposal number n, sends Prepare(n)
Phase 1b (Promise):  Acceptor promises not to accept proposals < n
Phase 2a (Accept):   Proposer sends Accept(n, value)
Phase 2b (Accepted): Acceptor accepts if no higher proposal seen

Pros:

  • Proven correct
  • Handles any number of failures
  • Minimal message complexity

Cons:

  • Extremely difficult to understand
  • Hard to implement correctly
  • Multi-Paxos adds complexity
  • No leader by default

Raft (2014)

Raft was designed specifically for understandability:

graph TB
    subgraph "Raft Components"
        LE[Leader Election]
        LR[Log Replication]
        SM[State Machine]
        Safety[Safety Properties]

        LE --> LR
        LR --> SM
        Safety --> LE
        Safety --> LR
    end

Pros:

  • Designed for understandability
  • Clear separation of concerns
  • Strong leader simplifies logic
  • Practical implementation guidance
  • Widely adopted

Cons:

  • Leader can be bottleneck
  • Not as optimized as Multi-Paxos variants

When Do You Need Consensus?

Use consensus when:

ScenarioNeed Consensus?Reason
Single-node databaseNoNo distributed state
Multi-master replicationYesMust agree on write order
Leader electionYesMust agree on who is leader
Configuration managementYesAll nodes need same config
Distributed lock serviceYesMust agree on lock holder
Load balancer stateNoStateless, can be rebuilt
Cache invalidationSometimesDepends on consistency needs

When You DON'T Need Consensus

  • Read-only systems: No state to agree on
  • Eventual consistency is enough: Last-write-wins suffices
  • Conflict-free replicated data types (CRDTs): Mathematically resolve conflicts
  • Single source of truth: Centralized authority

Simple Consensus Example

Let's look at a simplified consensus scenario: agreeing on a counter value.

TypeScript Example

// A simple consensus simulation
interface Proposal {
  value: number;
  proposerId: string;
}

class ConsensusNode {
  private proposals: Map<string, Proposal> = new Map();
  private decidedValue?: number;
  private nodeId: string;

  constructor(nodeId: string) {
    this.nodeId = nodeId;
  }

  // Propose a value
  propose(value: number): void {
    const proposal: Proposal = {
      value,
      proposerId: this.nodeId
    };
    this.proposals.set(this.nodeId, proposal);
    this.broadcastProposal(proposal);
  }

  // Receive a proposal from another node
  receiveProposal(proposal: Proposal): void {
    this.proposals.set(proposal.proposerId, proposal);
    this.checkConsensus();
  }

  // Check if we have consensus
  private checkConsensus(): void {
    if (this.decidedValue !== undefined) return;

    const values = Array.from(this.proposals.values()).map(p => p.value);
    const counts = new Map<number, number>();

    for (const value of values) {
      counts.set(value, (counts.get(value) || 0) + 1);
    }

    // Simple majority consensus
    for (const [value, count] of counts.entries()) {
      if (count > Math.floor(this.proposals.size / 2)) {
        this.decidedValue = value;
        console.log(`Node ${this.nodeId} decided on value: ${value}`);
        return;
      }
    }
  }

  private broadcastProposal(proposal: Proposal): void {
    // In a real system, this would send to other nodes
    console.log(`Node ${this.nodeId} broadcasting proposal: ${proposal.value}`);
  }
}

// Example usage
const node1 = new ConsensusNode('node1');
const node2 = new ConsensusNode('node2');
const node3 = new ConsensusNode('node3');

node1.propose(42);
node2.propose(42);
node3.propose(99);  // Minority, should lose

Python Example

from dataclasses import dataclass
from typing import Optional, Dict
import random

@dataclass
class Proposal:
    value: int
    proposer_id: str

class ConsensusNode:
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.proposals: Dict[str, Proposal] = {}
        self.decided_value: Optional[int] = None

    def propose(self, value: int) -> None:
        """Propose a value to the group."""
        proposal = Proposal(value, self.node_id)
        self.proposals[self.node_id] = proposal
        self._broadcast_proposal(proposal)
        self._check_consensus()

    def receive_proposal(self, proposal: Proposal) -> None:
        """Receive a proposal from another node."""
        self.proposals[proposal.proposer_id] = proposal
        self._check_consensus()

    def _check_consensus(self) -> None:
        """Check if we have consensus on a value."""
        if self.decided_value is not None:
            return

        if not self.proposals:
            return

        # Count occurrences of each value
        counts = {}
        for proposal in self.proposals.values():
            counts[proposal.value] = counts.get(proposal.value, 0) + 1

        # Simple majority consensus
        total_nodes = len(self.proposals)
        for value, count in counts.items():
            if count > total_nodes // 2:
                self.decided_value = value
                print(f"Node {self.node_id} decided on value: {value}")
                return

    def _broadcast_proposal(self, proposal: Proposal) -> None:
        """Broadcast proposal to other nodes."""
        print(f"Node {self.node_id} broadcasting proposal: {proposal.value}")

# Example usage
if __name__ == "__main__":
    node1 = ConsensusNode("node1")
    node2 = ConsensusNode("node2")
    node3 = ConsensusNode("node3")

    node1.propose(42)
    node2.propose(42)
    node3.propose(99)  # Minority, should lose

Common Pitfalls

PitfallDescriptionSolution
Split BrainMultiple leaders think they're in chargeUse quorum-based voting
Stale ReadsReading from nodes that haven't received updatesRead from leader or use quorum reads
Network Partition HandlingNodes can't communicate but continue operatingRequire quorum for operations
Partial FailuresSome nodes fail, others continueDesign for fault tolerance
Clock SkewDifferent clocks cause ordering issuesUse logical clocks (Lamport timestamps)

Summary

Key Takeaways

  1. Consensus is the problem of getting multiple distributed nodes to agree on a single value
  2. Safety ensures nothing bad happens (agreement, validity, integrity)
  3. Liveness ensures something good happens (termination, progress)
  4. FLP Impossibility proves consensus is impossible in pure asynchronous systems
  5. Real systems work around FLP using partial synchrony and timeouts
  6. Raft was designed for understandability, unlike the complex Paxos algorithm

Next Session

In the next session, we'll dive into the Raft algorithm itself:

  • Raft's design philosophy
  • Node states (Follower, Candidate, Leader)
  • How leader election works
  • How log replication maintains consistency

Exercises

  1. Safety vs Liveness: Give an example of a system that prioritizes safety over liveness, and one that does the opposite.

  2. FLP Scenario: Describe a scenario where FLP would cause problems in a real distributed system.

  3. Consensus Need: For each of these systems, explain whether they need consensus and why:

    • A distributed key-value store
    • A CDN (content delivery network)
    • A distributed task queue
    • A blockchain system
  4. Simple Consensus: Extend the simple consensus example to handle node failures (a node stops responding).

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

The Raft Algorithm

Session 9, Part 1 - 25 minutes

Learning Objectives

  • Understand Raft's design philosophy
  • Learn the three states of a Raft node
  • Explore how Raft handles consensus through leader election and log replication
  • Understand the concept of terms in Raft
  • Learn Raft's safety properties

Raft Design Philosophy

Raft was designed by Diego Ongaro and John Ousterhout in 2014 with a specific goal: understandability. Unlike Paxos, which was notoriously difficult to understand and implement correctly, Raft separates the consensus problem into clear, manageable subproblems.

Core Design Principles

  1. Strong Leader: Raft uses a strong leader approach—all log entries flow through the leader
  2. Leader Completeness: Once a log entry is committed, it stays in the log of all future leaders
  3. Decomposition: Break consensus into three subproblems:
    • Leader election
    • Log replication
    • Safety

Why "Raft"?

The name is an analogy: a raft (the algorithm) keeps all nodes (logs) together and moving in the same direction, just like a raft keeps people together on water.


Raft Overview

graph TB
    subgraph "Raft Consensus"
        Client[Client]

        subgraph "Cluster"
            L[Leader]
            F1[Follower 1]
            F2[Follower 2]
            F3[Follower 3]

            L --> F1
            L --> F2
            L --> F3
        end

        Client -->|Write Request| L
        L -->|AppendEntries| F1 & F2 & F3
        F1 & F2 & F3 -->|Ack| L
        L -->|Response| Client
    end

Key Concepts

ConceptDescription
LeaderThe only node that handles client requests and appends entries to the log
FollowerPassive nodes that replicate the leader's log
CandidateA node campaigning to become leader during an election
TermA logical clock divided into terms of arbitrary length
LogA sequence of entries containing commands to apply to the state machine

Node States

Each Raft node can be in one of three states:

stateDiagram-v2
    [*] --> Follower: Node starts

    Follower --> Candidate: Election timeout expires<br/>no valid RPC received
    Candidate --> Leader: Receives votes from majority
    Candidate --> Follower: Discovers current leader<br/>or higher term
    Leader --> Follower: Discovers higher term

    Follower --> Follower: Receives valid AppendEntries/RPC<br/>from leader or candidate

    note right of Follower
        - Responds to RPCs
        - No outgoing RPCs
        - Election timeout running
    end note

    note right of Candidate
        - Requesting votes
        - Election timeout running
        - Can become leader or follower
    end note

    note right of Leader
        - Handles all client requests
        - Sends heartbeats to followers
        - No timeout (active)
    end note

State Descriptions

Follower

  • Default state for all nodes
  • Passively receives entries from the leader
  • Responds to RPCs (RequestVote, AppendEntries)
  • If no communication for election timeout, becomes candidate

Candidate

  • Campaigning to become leader
  • Increments current term
  • Votes for itself
  • Sends RequestVote RPCs to all other nodes
  • Becomes leader if it receives votes from majority
  • Returns to follower if it discovers current leader or higher term

Leader

  • Handles all client requests
  • Sends AppendEntries RPCs to all followers (heartbeats)
  • Commits entries once replicated to majority
  • Steps down if it discovers a higher term

Terms

A term is Raft's logical time mechanism:

timeline
    title Raft Terms
    Term 1 : Leader A elected
           : Normal operation
           : Leader A crashes

    Term 2 : Election begins
           : Split vote!
           : Timeout, new election

    Term 3 : Leader B elected
           : Normal operation

Term Properties

  1. Monotonically Increasing: Terms always go up, never down
  2. Current Term: Each node stores the current term number
  3. Term Transitions:
    • Nodes increment term when becoming candidate
    • Nodes update term when receiving higher-term message
    • When term changes, node becomes follower

Term in Messages

sequenceDiagram
    participant C as Candidate
    participant F1 as Follower (term=3)
    participant F2 as Follower (term=4)

    C->>F1: RequestVote(term=5)
    Note over F1: Sees higher term
    F1-->>C: Vote YES (updates to term=5)

    C->>F2: RequestVote(term=5)
    Note over F2: Already at higher term
    F2-->>C: Vote NO (my term is higher)

Raft's Two-Phase Approach

Raft achieves consensus through two main phases:

Phase 1: Leader Election

sequenceDiagram
    autonumber
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant F3 as Follower 3

    Note over F1,F3: Election timeout expires

    F1->>F1: Becomes Candidate (term=1)
    F1->>F2: RequestVote(term=1)
    F1->>F3: RequestVote(term=1)

    F2-->>F1: Grant vote (term=1)
    F3-->>F1: Grant vote (term=1)

    Note over F1: Won majority!
    F1->>F1: Becomes Leader
    F1->>F2: AppendEntries (heartbeat)
    F1->>F3: AppendEntries (heartbeat)

Phase 2: Log Replication

sequenceDiagram
    autonumber
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: SET x=5

    L->>L: Append to log (index=10, term=1)
    L->>F1: AppendEntries(entry: SET x=5)
    L->>F2: AppendEntries(entry: SET x=5)

    F1-->>L: Success (replicated)
    F2-->>L: Success (replicated)

    Note over L: Majority replicated!<br/>Commit entry

    L->>L: Apply to state machine: x=5
    L-->>C: Response: OK

Safety Properties

Raft guarantees several important safety properties:

1. Election Safety

At most one leader can be elected per term.

How: Each node votes at most once per term, and a candidate needs majority of votes.

graph TB
    subgraph "Same Term - Only One Leader"
        T[Term 5]
        C1[Candidate A: 2 votes]
        C2[Candidate B: 1 vote]
        C1 -->|wins majority| L[Leader A]
        style L fill:#90EE90
    end

2. Leader Append-Only

A leader never overwrites or deletes entries in its log; it only appends.

How: Leaders always append new entries to the end of their log.

3. Log Matching

If two logs contain an entry with the same index and term, then all preceding entries are identical.

graph LR
    subgraph "Leader's Log"
        L1[index 1, term 1: SET a=1]
        L2[index 2, term 1: SET b=2]
        L3[index 3, term 2: SET c=3]
        L1 --> L2 --> L3
    end

    subgraph "Follower's Log"
        F1[index 1, term 1: SET a=1]
        F2[index 2, term 1: SET b=2]
        F3[index 3, term 2: SET c=3]
        F4[index 4, term 2: SET d=4]
        F1 --> F2 --> F3 --> F4
    end

    Match[Entries 1-3 match!<br/>Follower may have extra]

4. Leader Completeness

If a log entry is committed in a given term, it will be present in the logs of all leaders for higher terms.

How: A candidate must have all committed entries before it can win an election.

5. State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.


Raft RPCs

Raft uses two main RPC types:

RequestVote RPC

interface RequestVoteArgs {
  term: number;           // Candidate's term
  candidateId: string;    // Candidate requesting vote
  lastLogIndex: number;   // Index of candidate's last log entry
  lastLogTerm: number;    // Term of candidate's last log entry
}

interface RequestVoteReply {
  term: number;           // Current term (for candidate to update)
  voteGranted: boolean;   // True if candidate received vote
}

Voting Rules:

  1. If term < currentTerm: deny vote
  2. If votedFor is null or candidateId: grant vote
  3. If candidate's log is at least as up-to-date: grant vote

AppendEntries RPC

interface AppendEntriesArgs {
  term: number;           // Leader's term
  leaderId: string;       // So follower can redirect clients
  prevLogIndex: number;   // Index of log entry preceding new ones
  prevLogTerm: number;    // Term of prevLogIndex entry
  entries: LogEntry[];    // Log entries to store (empty for heartbeat)
  leaderCommit: number;   // Leader's commit index
}

interface AppendEntriesReply {
  term: number;           // Current term (for leader to update)
  success: boolean;       // True if follower had entry matching prevLogIndex
}

Used for both:

  • Log replication: Sending new entries
  • Heartbeats: Empty entries to maintain authority

Log Completeness Property

When voting, nodes compare log completeness:

graph TB
    subgraph "Comparing Logs"
        A[Candidate Log]
        B[Follower Log]

        A --> A1[Last index: 10, term: 5]
        B --> B1[Last index: 9, term: 5]

        Result[A's log is more up-to-date<br/>because index 10 > 9]
    end

    subgraph "Tie-Breaking Rule"
        C[Candidate: last term=5]
        D[Follower: last term=6]

        Result2[Follower is more up-to-date<br/>because term 6 > 5]
    end

Up-to-date comparison:

  1. Compare the term of the last entries
  2. If terms differ, the log with the higher term is more up-to-date
  3. If terms are same, the log with the longer length is more up-to-date

Randomized Election Timeouts

Raft uses randomized election timeouts to prevent split votes:

timeline
    title Randomized Timeouts Prevent Split Votes

    Node1 : 150ms timeout
    Node2 : 300ms timeout
    Node3 : 200ms timeout

    Node1 : Timeout! Becomes candidate
    Node1 : Wins election before Node2/3 timeout
    Node2 & Node3 : Receive heartbeat, reset timeouts

Without randomization: All followers timeout simultaneously → all become candidates → split vote → no leader elected.

With randomization: Only one follower times out first → becomes candidate → likely to win election.


TypeScript Implementation Structure

// Type definitions for Raft
type NodeState = 'follower' | 'candidate' | 'leader';

interface LogEntry {
  index: number;
  term: number;
  command: { key: string; value: any };
}

interface RaftNode {
  // Persistent state
  currentTerm: number;
  votedFor: string | null;
  log: LogEntry[];

  // Volatile state
  commitIndex: number;
  lastApplied: number;
  state: NodeState;

  // Leader-only volatile state
  nextIndex: number[];
  matchIndex: number[];
}

class RaftNodeImpl implements RaftNode {
  currentTerm: number = 0;
  votedFor: string | null = null;
  log: LogEntry[] = [];
  commitIndex: number = 0;
  lastApplied: number = 0;
  state: NodeState = 'follower';
  nextIndex: number[] = [];
  matchIndex: number[] = [];

  // Handle RequestVote RPC
  requestVote(args: RequestVoteArgs): RequestVoteReply {
    if (args.term > this.currentTerm) {
      this.currentTerm = args.term;
      this.state = 'follower';
      this.votedFor = null;
    }

    const logOk = this.isLogAtLeastAsUpToDate(args.lastLogIndex, args.lastLogTerm);
    const voteOk = (this.votedFor === null || this.votedFor === args.candidateId);

    if (args.term === this.currentTerm && voteOk && logOk) {
      this.votedFor = args.candidateId;
      return { term: this.currentTerm, voteGranted: true };
    }

    return { term: this.currentTerm, voteGranted: false };
  }

  // Handle AppendEntries RPC
  appendEntries(args: AppendEntriesArgs): AppendEntriesReply {
    if (args.term > this.currentTerm) {
      this.currentTerm = args.term;
      this.state = 'follower';
    }

    if (args.term !== this.currentTerm) {
      return { term: this.currentTerm, success: false };
    }

    // Check if log has entry at prevLogIndex with prevLogTerm
    if (this.log[args.prevLogIndex]?.term !== args.prevLogTerm) {
      return { term: this.currentTerm, success: false };
    }

    // Append new entries
    for (const entry of args.entries) {
      this.log[entry.index] = entry;
    }

    // Update commit index
    if (args.leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(args.leaderCommit, this.log.length - 1);
    }

    return { term: this.currentTerm, success: true };
  }

  private isLogAtLeastAsUpToDate(lastLogIndex: number, lastLogTerm: number): boolean {
    const myLastEntry = this.log[this.log.length - 1];
    const myLastTerm = myLastEntry?.term ?? 0;
    const myLastIndex = this.log.length - 1;

    if (lastLogTerm !== myLastTerm) {
      return lastLogTerm > myLastTerm;
    }
    return lastLogIndex >= myLastIndex;
  }
}

Python Implementation Structure

from dataclasses import dataclass, field
from typing import Optional, List
from enum import Enum

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    index: int
    term: int
    command: dict

@dataclass
class RequestVoteArgs:
    term: int
    candidate_id: str
    last_log_index: int
    last_log_term: int

@dataclass
class RequestVoteReply:
    term: int
    vote_granted: bool

@dataclass
class AppendEntriesArgs:
    term: int
    leader_id: str
    prev_log_index: int
    prev_log_term: int
    entries: List[LogEntry]
    leader_commit: int

@dataclass
class AppendEntriesReply:
    term: int
    success: bool

class RaftNode:
    def __init__(self, node_id: str, peers: List[str]):
        # Persistent state
        self.current_term: int = 0
        self.voted_for: Optional[str] = None
        self.log: List[LogEntry] = []

        # Volatile state
        self.commit_index: int = 0
        self.last_applied: int = 0
        self.state: NodeState = NodeState.FOLLOWER

        # Leader-only state
        self.next_index: dict[str, int] = {}
        self.match_index: dict[str, int] = {}

        self.node_id = node_id
        self.peers = peers

    def request_vote(self, args: RequestVoteArgs) -> RequestVoteReply:
        """Handle RequestVote RPC."""
        if args.term > self.current_term:
            self.current_term = args.term
            self.state = NodeState.FOLLOWER
            self.voted_for = None

        log_ok = self._is_log_at_least_as_up_to_date(
            args.last_log_index, args.last_log_term
        )
        vote_ok = (self.voted_for is None or self.voted_for == args.candidate_id)

        if args.term == self.current_term and vote_ok and log_ok:
            self.voted_for = args.candidate_id
            return RequestVoteReply(self.current_term, True)

        return RequestVoteReply(self.current_term, False)

    def append_entries(self, args: AppendEntriesArgs) -> AppendEntriesReply:
        """Handle AppendEntries RPC."""
        if args.term > self.current_term:
            self.current_term = args.term
            self.state = NodeState.FOLLOWER

        if args.term != self.current_term:
            return AppendEntriesReply(self.current_term, False)

        # Check if log has entry at prev_log_index with prev_log_term
        if len(self.log) <= args.prev_log_index:
            return AppendEntriesReply(self.current_term, False)

        if self.log[args.prev_log_index].term != args.prev_log_term:
            return AppendEntriesReply(self.current_term, False)

        # Append new entries
        for entry in args.entries:
            if len(self.log) > entry.index:
                if self.log[entry.index].term != entry.term:
                    # Conflict: delete from this point
                    self.log = self.log[:entry.index]
            if len(self.log) <= entry.index:
                self.log.append(entry)

        # Update commit index
        if args.leader_commit > self.commit_index:
            self.commit_index = min(args.leader_commit, len(self.log) - 1)

        return AppendEntriesReply(self.current_term, True)

    def _is_log_at_least_as_up_to_date(self, last_index: int, last_term: int) -> bool:
        """Check if candidate's log is at least as up-to-date as ours."""
        if not self.log:
            return True

        my_last_entry = self.log[-1]
        my_last_term = my_last_entry.term
        my_last_index = len(self.log) - 1

        if last_term != my_last_term:
            return last_term > my_last_term
        return last_index >= my_last_index

Summary

Key Takeaways

  1. Raft was designed for understandability, separating consensus into clear subproblems
  2. Three node states: Follower → Candidate → Leader
  3. Terms provide a logical clock and prevent stale leaders
  4. Two main RPCs: RequestVote (election) and AppendEntries (replication + heartbeat)
  5. Randomized timeouts prevent split votes during elections
  6. Five safety properties guarantee correctness: election safety, append-only, log matching, leader completeness, and state machine safety

Next Session

In the next session, we'll dive into Leader Election:

  • How elections are triggered
  • The election algorithm in detail
  • Handling split votes
  • Leader election examples

Exercises

  1. State Transitions: Draw the state transition diagram for a node that starts as follower, becomes candidate, wins election as leader, then discovers a higher term.

  2. Term Logic: If a node receives an AppendEntries with term=7 but its current term is 9, what should it do?

  3. Log Comparison: Compare these two logs and determine which is more up-to-date:

    • Log A: last index=15, last term=5
    • Log B: last index=12, last term=7
  4. Split Vote: Describe a scenario where a split vote could occur, and how Raft prevents it from persisting.

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

Raft Leader Election

Session 9, Part 1 - 45 minutes

Learning Objectives

  • Understand how Raft elects a leader democratically
  • Implement the RequestVote RPC
  • Handle election timeouts and randomized intervals
  • Prevent split votes with election safety
  • Build a working leader election system

Concept: Democratic Leader Election

In the previous chapter, we learned about Raft's design philosophy. Now let's dive into the leader election mechanism — the democratic process by which nodes agree on who should lead.

Why Do We Need a Leader?

Without a Leader:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node A  │     │ Node B  │     │ Node C  │
│         │     │         │     │         │
│ "I'm    │     │ "No,    │     │ "Both   │
│ leader!" │     │ I am!"  │     │ wrong!" │
└─────────┘     └─────────┘     └─────────┘
     Chaos!        Split brain!   Confusion!

With Raft Leader Election:
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Node A  │     │ Node B  │     │ Node C  │
│         │     │         │     │         │
│ "I      │     │ "I      │     │ "I vote │
│ vote    │---> │ vote    │---> │ for     │
│ for B"   │     │ for B"   │     │ B"      │
└────┬────┘     └────┬────┘     └────┬────┘
     │               │               │
     └───────────────┴───────────────┘
                     │
                     ▼
              ┌──────────┐
              │ Node B   │
              │ = LEADER │
              └──────────┘

Key Insight: Nodes vote for each other. The node with majority votes becomes leader.


State Transitions During Election

Raft nodes cycle through three states during leader election:

stateDiagram-v2
    [*] --> Follower: Start

    Follower --> Candidate: Election timeout
    Follower --> Follower: Receive valid AppendEntries
    Follower --> Follower: Discover higher term

    Candidate --> Leader: Receive majority votes
    Candidate --> Candidate: Split vote (timeout)
    Candidate --> Follower: Discover higher term
    Candidate --> Follower: Receive valid AppendEntries

    Leader --> Follower: Discover higher term

    note right of Follower
        - Votes for at most one candidate per term
        - Resets election timeout on heartbeat
    end note

    note right of Candidate
        - Increments current term
        - Votes for self
        - Sends RequestVote to all nodes
        - Randomized timeout prevents deadlock
    end note

    note right of Leader
        - Sends heartbeats (empty AppendEntries)
        - Handles client requests
        - Replicates log entries
    end note

The Election Algorithm Step by Step

Step 1: Follower Timeout

When a follower doesn't hear from a leader within the election timeout:

Time ────────────────────────────────────────────────────────>

Node A: [waiting...] [waiting...] ⏱️ TIMEOUT! → Become Candidate
Node B: [waiting...] [waiting...] [waiting...]
Node C: [waiting...] [waiting...] [waiting...]

Step 2: Become Candidate

The node transitions to candidate state:

sequenceDiagram
    participant C as Candidate (Node A)
    participant A as All Nodes

    C->>C: Increment term (e.g., term = 4)
    C->>C: Vote for self
    C->>A: Send RequestVote(term=4) to all

    Note over C: Wait for votes...

    par Each follower processes RequestVote
        A->>A: If term < currentTerm: reject
        A->>A: If votedFor != null: reject
        A->>A: If candidate log is up-to-date: grant vote
    end

    A-->>C: Send vote response

    alt Majority votes received
        C->>C: Become LEADER
    else Split vote
        C->>C: Wait for timeout, then retry
    end

Step 3: RequestVote RPC

The RequestVote RPC is the ballot paper in Raft's election:

graph LR
    subgraph RequestVote RPC
        C[term] --> D["Candidate's term"]
        E[candidateId] --> F["Node requesting vote"]
        G[lastLogIndex] --> H["Index of candidate's last log entry"]
        I[lastLogTerm] --> J["Term of candidate's last log entry"]
    end

    subgraph Response
        K[term] --> L["Current term (for candidate to update)"]
        M[voteGranted] --> N["true if follower voted"]
    end

Voting Rule: A follower grants vote if:

  1. Candidate's term > follower's currentTerm, OR
  2. Terms equal AND follower hasn't voted yet AND candidate's log is at least as up-to-date

Randomized Election Timeouts

The Split Vote Problem

Without randomization, simultaneous elections cause deadlocks:

Bad: Fixed timeouts cause repeated split votes
Node A: timeout at T=100 → Candidate, gets 1 vote
Node B: timeout at T=100 → Candidate, gets 1 vote
Node C: timeout at T=100 → Candidate, gets 1 vote

Result: Nobody wins! Election timeout...
Same thing repeats forever!

Solution: Randomized Intervals

Each node picks a random timeout within a range:

gantt
    title Election Timeouts (Randomized: 150-300ms)
    dateFormat X
    axisFormat %L

    Node A :a1, 0, 180
    Node B :b1, 0, 220
    Node C :c1, 0, 160

    Node A becomes Candidate :milestone, m1, 180, 0s
    Node C becomes Candidate :milestone, m2, 160, 0s

Node C times out first and starts election. Node A and B reset their timeouts when they receive RequestVote, allowing Node C to gather votes.

Probability Analysis: For a cluster of N nodes with timeout range [T, 2T]:

  • Probability of simultaneous timeout: ~1/N
  • With 5 nodes and 150-300ms range: P < 5%

TypeScript Implementation

Let's build a working Raft leader election system:

Core Types

// types/raft.ts

export type NodeState = 'follower' | 'candidate' | 'leader';

export interface LogEntry {
  index: number;
  term: number;
  command: unknown;
}

export interface RaftNodeConfig {
  id: string;
  peers: string[];  // List of peer node IDs
  electionTimeoutMin: number;  // Minimum timeout in ms
  electionTimeoutMax: number;  // Maximum timeout in ms
}

export interface RequestVoteArgs {
  term: number;
  candidateId: string;
  lastLogIndex: number;
  lastLogTerm: number;
}

export interface RequestVoteReply {
  term: number;
  voteGranted: boolean;
}

export interface AppendEntriesArgs {
  term: number;
  leaderId: string;
  prevLogIndex: number;
  prevLogTerm: number;
  entries: LogEntry[];
  leaderCommit: number;
}

export interface AppendEntriesReply {
  term: number;
  success: boolean;
}

Raft Node Implementation

// raft-node.ts

import { RaftNodeConfig, NodeState, LogEntry, RequestVoteArgs, RequestVoteReply } from './types';

export class RaftNode {
  private state: NodeState = 'follower';
  private currentTerm: number = 0;
  private votedFor: string | null = null;
  private log: LogEntry[] = [];

  // Election timeout
  private electionTimer: NodeJS.Timeout | null = null;
  private lastHeartbeat: number = Date.now();

  // Leader-only state
  private leaderId: string | null = null;

  constructor(private config: RaftNodeConfig) {
    this.startElectionTimer();
  }

  // ========== Public API ==========

  getState(): NodeState {
    return this.state;
  }

  getCurrentTerm(): number {
    return this.currentTerm;
  }

  getLeader(): string | null {
    return this.leaderId;
  }

  // ========== RPC Handlers ==========

  /**
   * Invoked by candidates to gather votes
   */
  requestVote(args: RequestVoteArgs): RequestVoteReply {
    const reply: RequestVoteReply = {
      term: this.currentTerm,
      voteGranted: false
    };

    // Rule 1: If candidate's term is lower, reject
    if (args.term < this.currentTerm) {
      return reply;
    }

    // Rule 2: If candidate's term is higher, update and become follower
    if (args.term > this.currentTerm) {
      this.becomeFollower(args.term);
      reply.term = args.term;
    }

    // Rule 3: If we already voted for someone else this term, reject
    if (this.votedFor !== null && this.votedFor !== args.candidateId) {
      return reply;
    }

    // Rule 4: Check if candidate's log is at least as up-to-date as ours
    const lastEntry = this.log.length > 0 ? this.log[this.log.length - 1] : null;
    const lastLogIndex = lastEntry ? lastEntry.index : 0;
    const lastLogTerm = lastEntry ? lastEntry.term : 0;

    const logIsUpToDate =
      (args.lastLogTerm > lastLogTerm) ||
      (args.lastLogTerm === lastLogTerm && args.lastLogIndex >= lastLogIndex);

    if (!logIsUpToDate) {
      return reply;
    }

    // Grant vote
    this.votedFor = args.candidateId;
    reply.voteGranted = true;
    this.resetElectionTimer();

    console.log(`Node ${this.config.id} voted for ${args.candidateId} in term ${args.term}`);
    return reply;
  }

  /**
   * Invoked by leader to assert authority (heartbeat or log replication)
   */
  receiveHeartbeat(term: number, leaderId: string): void {
    if (term >= this.currentTerm) {
      if (term > this.currentTerm) {
        this.becomeFollower(term);
      }
      this.leaderId = leaderId;
      this.resetElectionTimer();
    }
  }

  // ========== State Transitions ==========

  private becomeFollower(term: number): void {
    this.state = 'follower';
    this.currentTerm = term;
    this.votedFor = null;
    this.leaderId = null;
    this.resetElectionTimer();
    console.log(`Node ${this.config.id} became follower in term ${term}`);
  }

  private becomeCandidate(): void {
    this.state = 'candidate';
    this.currentTerm += 1;
    this.votedFor = this.config.id;
    this.leaderId = null;

    console.log(`Node ${this.config.id} became candidate in term ${this.currentTerm}`);

    // Start election
    this.startElection();
  }

  private becomeLeader(): void {
    this.state = 'leader';
    this.leaderId = this.config.id;

    console.log(`Node ${this.config.id} became LEADER in term ${this.currentTerm}`);

    // Start sending heartbeats
    this.startHeartbeats();
  }

  // ========== Election Logic ==========

  private startElectionTimer(): void {
    if (this.electionTimer) {
      clearTimeout(this.electionTimer);
    }

    const timeout = this.getRandomElectionTimeout();

    this.electionTimer = setTimeout(() => {
      // Only transition if we haven't heard from a leader
      if (this.state === 'follower') {
        console.log(`Node ${this.config.id} election timeout`);
        this.becomeCandidate();
      }
    }, timeout);
  }

  private resetElectionTimer(): void {
    this.startElectionTimer();
  }

  private getRandomElectionTimeout(): number {
    const { electionTimeoutMin, electionTimeoutMax } = this.config;
    return Math.floor(
      Math.random() * (electionTimeoutMax - electionTimeoutMin + 1)
    ) + electionTimeoutMin;
  }

  private async startElection(): Promise<void> {
    const args: RequestVoteArgs = {
      term: this.currentTerm,
      candidateId: this.config.id,
      lastLogIndex: this.log.length > 0 ? this.log[this.log.length - 1].index : 0,
      lastLogTerm: this.log.length > 0 ? this.log[this.log.length - 1].term : 0
    };

    let votesReceived = 1; // Vote for self
    const majority = Math.floor(this.config.peers.length / 2) + 1;

    // Send RequestVote to all peers
    const promises = this.config.peers.map(peerId =>
      this.sendRequestVote(peerId, args)
    );

    const responses = await Promise.allSettled(promises);

    // Count votes
    for (const result of responses) {
      if (result.status === 'fulfilled' && result.value.voteGranted) {
        votesReceived++;
      }
    }

    // Check if we won
    if (votesReceived >= majority && this.state === 'candidate') {
      this.becomeLeader();
    }
  }

  // ========== Network Simulation ==========

  private async sendRequestVote(
    peerId: string,
    args: RequestVoteArgs
  ): Promise<RequestVoteReply> {
    // In a real implementation, this would be an HTTP/gRPC call
    // For this example, we simulate by calling directly
    // In the full example below, we'll use actual HTTP

    return {
      term: 0,
      voteGranted: false
    };
  }

  private startHeartbeats(): void {
    // Leader sends periodic heartbeats
    // Implementation in full example
  }

  stop(): void {
    if (this.electionTimer) {
      clearTimeout(this.electionTimer);
    }
  }
}

HTTP Server with Raft

// server.ts

import express, { Request, Response } from 'express';
import { RaftNode } from './raft-node';
import { RequestVoteArgs, RequestVoteReply } from './types';

export class RaftServer {
  private app: express.Application;
  private node: RaftNode;
  private server: any;

  constructor(
    private nodeId: string,
    private port: number,
    peers: string[]
  ) {
    this.app = express();
    this.app.use(express.json());

    this.node = new RaftNode({
      id: nodeId,
      peers: peers,
      electionTimeoutMin: 150,
      electionTimeoutMax: 300
    });

    this.setupRoutes();
  }

  private setupRoutes(): void {
    // RequestVote RPC endpoint
    this.app.post('/raft/request-vote', (req: Request, res: Response) => {
      const args: RequestVoteArgs = req.body;
      const reply: RequestVoteReply = this.node.requestVote(args);
      res.json(reply);
    });

    // Heartbeat endpoint
    this.app.post('/raft/heartbeat', (req: Request, res: Response) => {
      const { term, leaderId } = req.body;
      this.node.receiveHeartbeat(term, leaderId);
      res.json({ success: true });
    });

    // Status endpoint
    this.app.get('/status', (req: Request, res: Response) => {
      res.json({
        id: this.nodeId,
        state: this.node.getState(),
        term: this.node.getCurrentTerm(),
        leader: this.node.getLeader()
      });
    });
  }

  async start(): Promise<void> {
    this.server = this.app.listen(this.port, () => {
      console.log(`Node ${this.nodeId} listening on port ${this.port}`);
    });
  }

  stop(): void {
    this.node.stop();
    if (this.server) {
      this.server.close();
    }
  }

  getNode(): RaftNode {
    return this.node;
  }
}

Python Implementation

The same logic in Python:

# raft_node.py

import asyncio
import random
from dataclasses import dataclass
from enum import Enum
from typing import Optional, List
from datetime import datetime, timedelta

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    index: int
    term: int
    command: dict

@dataclass
class RequestVoteArgs:
    term: int
    candidate_id: str
    last_log_index: int
    last_log_term: int

@dataclass
class RequestVoteReply:
    term: int
    vote_granted: bool

class RaftNode:
    def __init__(self, node_id: str, peers: List[str],
                 election_timeout_min: int = 150,
                 election_timeout_max: int = 300):
        self.node_id = node_id
        self.peers = peers

        # Persistent state
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: List[LogEntry] = []

        # Volatile state
        self.state = NodeState.FOLLOWER
        self.leader_id: Optional[str] = None

        # Election timeout
        self.election_timeout_min = election_timeout_min
        self.election_timeout_max = election_timeout_max
        self.election_task: Optional[asyncio.Task] = None
        self.heartbeat_task: Optional[asyncio.Task] = None

        # Start election timer
        self.start_election_timer()

    async def request_vote(self, args: RequestVoteArgs) -> RequestVoteReply:
        """Handle RequestVote RPC from candidate"""
        reply = RequestVoteReply(
            term=self.current_term,
            vote_granted=False
        )

        # Rule 1: Reject if candidate's term is lower
        if args.term < self.current_term:
            return reply

        # Rule 2: Update to higher term and become follower
        if args.term > self.current_term:
            await self.become_follower(args.term)
            reply.term = args.term

        # Rule 3: Reject if already voted for another candidate
        if self.voted_for is not None and self.voted_for != args.candidate_id:
            return reply

        # Rule 4: Check if candidate's log is up-to-date
        last_entry = self.log[-1] if self.log else None
        last_log_index = last_entry.index if last_entry else 0
        last_log_term = last_entry.term if last_entry else 0

        log_is_up_to_date = (
            args.last_log_term > last_log_term or
            (args.last_log_term == last_log_term and
             args.last_log_index >= last_log_index)
        )

        if not log_is_up_to_date:
            return reply

        # Grant vote
        self.voted_for = args.candidate_id
        reply.vote_granted = True
        self.reset_election_timer()

        print(f"Node {self.node_id} voted for {args.candidate_id} in term {args.term}")
        return reply

    async def receive_heartbeat(self, term: int, leader_id: str):
        """Handle heartbeat from leader"""
        if term >= self.current_term:
            if term > self.current_term:
                await self.become_follower(term)
            self.leader_id = leader_id
            self.reset_election_timer()

    async def become_follower(self, term: int):
        """Transition to follower state"""
        self.state = NodeState.FOLLOWER
        self.current_term = term
        self.voted_for = None
        self.leader_id = None

        # Cancel heartbeat task if running
        if self.heartbeat_task:
            self.heartbeat_task.cancel()
            self.heartbeat_task = None

        self.reset_election_timer()
        print(f"Node {self.node_id} became follower in term {term}")

    async def become_candidate(self):
        """Transition to candidate state and start election"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        self.leader_id = None

        print(f"Node {self.node_id} became candidate in term {self.current_term}")
        await self.start_election()

    async def become_leader(self):
        """Transition to leader state"""
        self.state = NodeState.LEADER
        self.leader_id = self.node_id

        print(f"Node {self.node_id} became LEADER in term {self.current_term}")
        self.start_heartbeats()

    def start_election_timer(self):
        """Start or reset the election timeout timer"""
        if self.election_task:
            self.election_task.cancel()

        timeout = self.get_random_election_timeout()
        self.election_task = asyncio.create_task(self.election_timeout(timeout))

    def reset_election_timer(self):
        """Reset the election timeout timer"""
        self.start_election_timer()

    def get_random_election_timeout(self) -> int:
        """Get random timeout within configured range"""
        return random.randint(
            self.election_timeout_min,
            self.election_timeout_max
        )

    async def election_timeout(self, timeout_ms: int):
        """Wait for timeout, then start election if still follower"""
        try:
            await asyncio.sleep(timeout_ms / 1000)
            if self.state == NodeState.FOLLOWER:
                print(f"Node {self.node_id} election timeout")
                await self.become_candidate()
        except asyncio.CancelledError:
            pass  # Timer was reset

    async def start_election(self):
        """Start leader election by sending RequestVote to all peers"""
        args = RequestVoteArgs(
            term=self.current_term,
            candidate_id=self.node_id,
            last_log_index=self.log[-1].index if self.log else 0,
            last_log_term=self.log[-1].term if self.log else 0
        )

        votes_received = 1  # Vote for self
        majority = len(self.peers) // 2 + 1

        # Send RequestVote to all peers concurrently
        tasks = [
            self.send_request_vote(peer_id, args)
            for peer_id in self.peers
        ]

        results = await asyncio.gather(*tasks, return_exceptions=True)

        # Count votes
        for result in results:
            if isinstance(result, RequestVoteReply) and result.vote_granted:
                votes_received += 1

        # Check if we won the election
        if votes_received >= majority and self.state == NodeState.CANDIDATE:
            await self.become_leader()

    async def send_request_vote(self, peer_id: str, args: RequestVoteArgs) -> RequestVoteReply:
        """Send RequestVote RPC to peer (simulated)"""
        # In real implementation, use HTTP/aiohttp
        # For this example, return mock response
        return RequestVoteReply(term=0, vote_granted=False)

    def start_heartbeats(self):
        """Leader sends periodic heartbeats"""
        if self.heartbeat_task:
            self.heartbeat_task.cancel()

        self.heartbeat_task = asyncio.create_task(self.send_heartbeats())

    async def send_heartbeats(self):
        """Send empty AppendEntries (heartbeats) to all followers"""
        while self.state == NodeState.LEADER:
            for peer_id in self.peers:
                # In real implementation, send HTTP POST
                await asyncio.sleep(0.05)  # Heartbeat interval: 50ms

    def stop(self):
        """Stop the node"""
        if self.election_task:
            self.election_task.cancel()
        if self.heartbeat_task:
            self.heartbeat_task.cancel()

Flask Server with Raft

# server.py

from flask import Flask, request, jsonify
from raft_node import RaftNode, RequestVoteArgs
import asyncio

app = Flask(__name__)

class RaftServer:
    def __init__(self, node_id: str, port: int, peers: list):
        self.node_id = node_id
        self.port = port
        self.node = RaftNode(node_id, peers)
        self.app = app
        self.setup_routes()

    def setup_routes(self):
        @self.app.route('/raft/request-vote', methods=['POST'])
        def request_vote():
            args = RequestVoteArgs(**request.json)
            reply = asyncio.run(self.node.request_vote(args))
            return jsonify({
                'term': reply.term,
                'voteGranted': reply.vote_granted
            })

        @self.app.route('/raft/heartbeat', methods=['POST'])
        def heartbeat():
            data = request.json
            asyncio.run(self.node.receive_heartbeat(
                data['term'], data['leaderId']
            ))
            return jsonify({'success': True})

        @self.app.route('/status', methods=['GET'])
        def status():
            return jsonify({
                'id': self.node_id,
                'state': self.node.state.value,
                'term': self.node.current_term,
                'leader': self.node.leader_id
            })

    def run(self):
        self.app.run(port=self.port, debug=False)

Docker Compose Setup

Let's deploy a 3-node Raft cluster:

# docker-compose.yml

version: '3.8'

services:
  node1:
    build:
      context: ./examples/04-consensus
      dockerfile: Dockerfile.typescript
    container_name: raft-node1
    environment:
      - NODE_ID=node1
      - PORT=3001
      - PEERS=node2:3002,node3:3003
    ports:
      - "3001:3001"
    networks:
      - raft-network

  node2:
    build:
      context: ./examples/04-consensus
      dockerfile: Dockerfile.typescript
    container_name: raft-node2
    environment:
      - NODE_ID=node2
      - PORT=3002
      - PEERS=node1:3001,node3:3003
    ports:
      - "3002:3002"
    networks:
      - raft-network

  node3:
    build:
      context: ./examples/04-consensus
      dockerfile: Dockerfile.typescript
    container_name: raft-node3
    environment:
      - NODE_ID=node3
      - PORT=3003
      - PEERS=node1:3001,node2:3002
    ports:
      - "3003:3003"
    networks:
      - raft-network

networks:
  raft-network:
    driver: bridge

Running the Example

TypeScript Version

# 1. Build and start the cluster
cd distributed-systems-course/examples/04-consensus
docker-compose up

# 2. Watch the election happen in the logs
# You'll see nodes transition from follower → candidate → leader

# 3. Check the status of each node
curl http://localhost:3001/status
curl http://localhost:3002/status
curl http://localhost:3003/status

# 4. Kill the leader and watch re-election
docker-compose stop node1  # If node1 was leader
# Watch the logs to see a new leader elected!

# 5. Clean up
docker-compose down

Python Version

# 1. Build and start the cluster
cd distributed-systems-course/examples/04-consensus
docker-compose -f docker-compose.python.yml up

# 2-5. Same as above, using ports 4001-4003 for Python nodes

Exercises

Exercise 1: Observe Election Safety

Run the cluster and answer these questions:

  1. How long does it take for a leader to be elected?
  2. What happens if you start nodes at different times?
  3. Can you observe a split vote scenario? (Hint: cause a network partition)

Exercise 2: Implement Pre-Vote

Pre-vote is an optimization that prevents disrupting a stable leader:

  1. Research the pre-vote mechanism
  2. Modify the RequestVote handler to check if leader is alive first
  3. Test that pre-vote prevents unnecessary elections

Exercise 3: Election Timeout Tuning

Experiment with different timeout ranges:

  1. Try 50-100ms: What happens? (Hint: too many elections)
  2. Try 500-1000ms: What happens? (Hint: slow leader failover)
  3. Find the optimal range for a 3-node cluster

Exercise 4: Network Partition Simulation

Simulate a network partition:

  1. Start the cluster and wait for leader election
  2. Isolate node1 from the network (using Docker network isolation)
  3. Observe: Does node1 think it's still leader?
  4. Reconnect: Does the cluster recover correctly?

Summary

In this chapter, you learned:

  • Why leader election matters: Prevents split-brain and confusion
  • Raft's democratic process: Nodes vote for each other
  • State transitions: Follower → Candidate → Leader
  • RequestVote RPC: The ballot paper of Raft elections
  • Randomized timeouts: Prevent split votes and deadlocks
  • Election safety: At most one leader per term

Next Chapter: Log Replication — Once we have a leader, how do we safely replicate data across the cluster?

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

Log Replication

Session 10, Part 1 - 30 minutes

Learning Objectives

  • Understand how Raft replicates logs across nodes
  • Learn the log matching property that ensures consistency
  • Implement the AppendEntries RPC
  • Handle log consistency conflicts
  • Understand commit index and state machine application

Concept: Keeping Everyone in Sync

Once a leader is elected, it needs to replicate client commands to all followers. This is the log replication phase of Raft.

The Challenge

Client sends "SET x = 5" to Leader

┌──────────┐         ┌──────────┐         ┌──────────┐
│  Leader  │         │ Follower │         │ Follower │
│          │         │    A     │         │    B     │
└────┬─────┘         └──────────┘         └──────────┘
     │
     │ How do we ensure ALL nodes
     │ have the SAME command log?
     │
     │ What if network fails?
     │ What if follower crashes?
     ▼

┌─────────────────────────────────────────┐
│         Log Replication Protocol        │
└─────────────────────────────────────────┘

Log Structure

Each node maintains a log of commands. A log entry contains:

interface LogEntry {
  index: number;      // Position in the log (starts at 1)
  term: number;       // Term when entry was received
  command: string;    // The actual command (e.g., "SET x = 5")
}
@dataclass
class LogEntry:
    index: int       # Position in the log (starts at 1)
    term: int        # Term when entry was received
    command: str     # The actual command (e.g., "SET x = 5")

Visual Log Representation

Node 1 (Leader)              Node 2 (Follower)           Node 3 (Follower)
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│ Index │ Term │ Cmd│         │ Index │ Term │ Cmd│         │ Index │ Term │ Cmd│
├───────┼──────┼────┤         ├───────┼──────┼────┤         ├───────┼──────┼────┤
│   1   │  1   │SET │         │   1   │  1   │SET │         │   1   │  1   │SET │
│   2   │  2   │SET │         │   2   │  2   │SET │         │   2   │  2   │SET │
│   3   │  2   │SET │         │   3   │  2   │SET │         │       │      │    │
│   4   │  2   │SET │         │       │      │    │         │       │      │    │
└───────┴──────┴────┘         └───────┴──────┴────┘         └───────┴──────┴────┘

The Log Matching Property

This is Raft's key safety guarantee. If two logs contain an entry with the same index and term, then all preceding entries are identical and in the same order.

         Log Matching Property
┌────────────────────────────────────────────────────────┐
│                                                        │
│   If logs[i].term == logs[j].term AND                  │
│   logs[i].index == logs[j].index                       │
│                                                        │
│   THEN:                                                │
│   logs[k] == logs[k] for all k < i                     │
│                                                        │
└────────────────────────────────────────────────────────┘

Example:

Node A: [1,1] [2,1] [3,2] [4,2] [5,2]
              │
Node B: [1,1] [2,1] [3,2] [4,2] [5,3] [6,3]
              │
              └─ Same index 3, term 2
                  Therefore entries 1-2 are IDENTICAL

This property allows Raft to efficiently detect and fix inconsistencies.


AppendEntries RPC

The leader uses AppendEntries to replicate log entries and also as a heartbeat.

RPC Specification

interface AppendEntriesRequest {
  term: number;           // Leader's term
  leaderId: string;       // So follower can redirect clients
  prevLogIndex: number;   // Index of log entry immediately preceding new ones
  prevLogTerm: number;    // Term of prevLogIndex entry
  entries: LogEntry[];    // Log entries to store (empty for heartbeat)
  leaderCommit: number;   // Leader's commit index
}

interface AppendEntriesResponse {
  term: number;           // Current term, for leader to update itself
  success: boolean;       // True if follower contained entry matching prevLogIndex/term
}
@dataclass
class AppendEntriesRequest:
    term: int              # Leader's term
    leader_id: str         # So follower can redirect clients
    prev_log_index: int    # Index of log entry immediately preceding new ones
    prev_log_term: int     # Term of prevLogIndex entry
    entries: List[LogEntry]  # Log entries to store (empty for heartbeat)
    leader_commit: int     # Leader's commit index

@dataclass
class AppendEntriesResponse:
    term: int              # Current term, for leader to update itself
    success: bool          # True if follower contained entry matching prevLogIndex/term

Log Replication Flow

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant F3 as Follower 3

    C->>L: SET x = 5
    L->>L: Append to log (uncommitted)
    L->>F1: AppendEntries(entries=[SET x=5], prevLogIndex=2, prevLogTerm=1)
    L->>F2: AppendEntries(entries=[SET x=5], prevLogIndex=2, prevLogTerm=1)
    L->>F3: AppendEntries(entries=[SET x=5], prevLogIndex=2, prevLogTerm=1)

    F1->>F1: Append to log, reply success
    F2->>F2: Append to log, reply success
    F3->>F3: Append to log, reply success

    Note over L: Received majority (2/3)
    L->>L: Commit index = 3
    L->>L: Apply to state machine: x = 5
    L->>C: Return success (x = 5)

    L->>F1: AppendEntries(entries=[], leaderCommit=3)
    L->>F2: AppendEntries(entries=[], leaderCommit=3)
    L->>F3: AppendEntries(entries=[], leaderCommit=3)

    F1->>F1: Apply committed entries
    F2->>F2: Apply committed entries
    F3->>F3: Apply committed entries

Handling Consistency Conflicts

When a follower's log conflicts with the leader's, the leader resolves it:

graph TD
    A[Leader sends AppendEntries] --> B{Follower checks<br/>prevLogIndex/term}
    B -->|Match found| C[Append new entries<br/>Return success=true]
    B -->|No match| D[Return success=false]

    D --> E[Leader decrements<br/>nextIndex for follower]
    E --> F{Retry with<br/>earlier log position?}

    F -->|Yes| A
    F -->|No match at index 0| G[Append leader's<br/>entire log]

    C --> H[Follower updates<br/>commit index if needed]
    H --> I[Apply committed entries<br/>to state machine]

Conflict Example

Before Conflict Resolution:

Leader:  [1,1] [2,2] [3,2]
Follower:[1,1] [2,1] [3,1] [4,3]  ← Diverged at index 2!

Step 1: Leader sends AppendEntries(prevLogIndex=2, prevLogTerm=2)
        Follower: No match! (has term 1, not 2) → Return success=false

Step 2: Leader decrements nextIndex, sends AppendEntries(prevLogIndex=1, prevLogTerm=1)
        Follower: Match! → Return success=true

Step 3: Leader sends entries starting from index 2
        Follower overwrites [2,1] [3,1] [4,3] with [2,2] [3,2]

After Conflict Resolution:

Leader:  [1,1] [2,2] [3,2]
Follower:[1,1] [2,2] [3,2]  ← Now consistent!

Commit Index

The commit index tracks which log entries are committed (durable and safe to apply).

let commitIndex = 0;  // Index of highest committed entry

// Leader rule: An entry from current term is committed
// once it's stored on a majority of servers
function updateCommitIndex(): void {
  const N = this.log.length;

  // Find the largest N such that:
  // 1. A majority of nodes have log entries up to N
  // 2. log[N].term == currentTerm (safety rule!)
  for (let i = N; i > this.commitIndex; i--) {
    if (this.log[i - 1].term === this.currentTerm && this.isMajorityReplicated(i)) {
      this.commitIndex = i;
      break;
    }
  }
}
commit_index: int = 0  # Index of highest committed entry

# Leader rule: An entry from current term is committed
# once it's stored on a majority of servers
def update_commit_index(self) -> None:
    N = len(self.log)

    # Find the largest N such that:
    # 1. A majority of nodes have log entries up to N
    # 2. log[N].term == currentTerm (safety rule!)
    for i in range(N, self.commit_index, -1):
        if self.log[i - 1].term == self.current_term and self.is_majority_replicated(i):
            self.commit_index = i
            break

Safety Rule: Only Commit Current Term Entries

graph LR
    A[Entry from<br/>previous term] -->|Can be<br/>committed| B[When current<br/>term entry exists]
    C[Entry from<br/>current term] -->|Can be<br/>committed| D[When replicated<br/>to majority]

    B --> E[Applied to<br/>state machine]
    D --> E

    style B fill:#f99
    style D fill:#9f9

Why? Prevents a leader from committing uncommitted entries from previous terms that could be overwritten.


TypeScript Implementation

Let's extend our Raft implementation with log replication:

// types.ts
export interface LogEntry {
  index: number;
  term: number;
  command: string;
}

export interface AppendEntriesRequest {
  term: number;
  leaderId: string;
  prevLogIndex: number;
  prevLogTerm: number;
  entries: LogEntry[];
  leaderCommit: number;
}

export interface AppendEntriesResponse {
  term: number;
  success: boolean;
}
// raft-node.ts
export class RaftNode {
  private log: LogEntry[] = [];
  private commitIndex = 0;
  private lastApplied = 0;

  // For each follower, track next log index to send
  private nextIndex: Map<string, number> = new Map();
  private matchIndex: Map<string, number> = new Map();

  // ... (previous code from leader election)

  /**
   * Handle AppendEntries RPC from leader
   */
  handleAppendEntries(req: AppendEntriesRequest): AppendEntriesResponse {
    // Reply false if term < currentTerm
    if (req.term < this.currentTerm) {
      return { term: this.currentTerm, success: false };
    }

    // Update current term if needed
    if (req.term > this.currentTerm) {
      this.currentTerm = req.term;
      this.state = NodeState.Follower;
      this.votedFor = null;
    }

    // Reset election timeout
    this.resetElectionTimeout();

    // Check log consistency
    if (req.prevLogIndex > 0) {
      if (this.log.length < req.prevLogIndex) {
        return { term: this.currentTerm, success: false };
      }

      const prevEntry = this.log[req.prevLogIndex - 1];
      if (prevEntry.term !== req.prevLogTerm) {
        return { term: this.currentTerm, success: false };
      }
    }

    // Append new entries
    if (req.entries.length > 0) {
      // Find first conflicting entry
      let insertIndex = req.prevLogIndex;
      for (const entry of req.entries) {
        if (insertIndex < this.log.length) {
          const existing = this.log[insertIndex];
          if (existing.index === entry.index && existing.term === entry.term) {
            // Already matches, skip
            insertIndex++;
            continue;
          }
          // Conflict! Delete from here and append
          this.log = this.log.slice(0, insertIndex);
        }
        this.log.push(entry);
        insertIndex++;
      }
    }

    // Update commit index
    if (req.leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(req.leaderCommit, this.log.length);
      this.applyCommittedEntries();
    }

    return { term: this.currentTerm, success: true };
  }

  /**
   * Apply committed entries to state machine
   */
  private applyCommittedEntries(): void {
    while (this.lastApplied < this.commitIndex) {
      this.lastApplied++;
      const entry = this.log[this.lastApplied - 1];
      this.stateMachine.apply(entry);
      console.log(`Node ${this.nodeId} applied: ${entry.command}`);
    }
  }

  /**
   * Leader: replicate log to followers
   */
  private replicateLog(): void {
    if (this.state !== NodeState.Leader) return;

    for (const followerId of this.clusterConfig.peerIds) {
      const nextIdx = this.nextIndex.get(followerId) || 1;

      const prevLogIndex = nextIdx - 1;
      const prevLogTerm = prevLogIndex > 0 ? this.log[prevLogIndex - 1].term : 0;
      const entries = this.log.slice(nextIdx - 1);

      const req: AppendEntriesRequest = {
        term: this.currentTerm,
        leaderId: this.nodeId,
        prevLogIndex,
        prevLogTerm,
        entries,
        leaderCommit: this.commitIndex,
      };

      this.sendAppendEntries(followerId, req);
    }
  }

  /**
   * Leader: handle AppendEntries response
   */
  private handleAppendEntriesResponse(
    followerId: string,
    resp: AppendEntriesResponse,
    req: AppendEntriesRequest
  ): void {
    if (this.state !== NodeState.Leader) return;

    if (resp.term > this.currentTerm) {
      // Follower has higher term, step down
      this.currentTerm = resp.term;
      this.state = NodeState.Follower;
      this.votedFor = null;
      return;
    }

    if (resp.success) {
      // Update match index and next index
      const lastIndex = req.prevLogIndex + req.entries.length;
      this.matchIndex.set(followerId, lastIndex);
      this.nextIndex.set(followerId, lastIndex + 1);

      // Try to commit more entries
      this.updateCommitIndex();
    } else {
      // Follower's log is inconsistent, backtrack
      const currentNext = this.nextIndex.get(followerId) || 1;
      this.nextIndex.set(followerId, Math.max(1, currentNext - 1));

      // Retry immediately
      setTimeout(() => this.replicateLog(), 50);
    }
  }

  /**
   * Leader: update commit index if majority has entry
   */
  private updateCommitIndex(): void {
    if (this.state !== NodeState.Leader) return;

    const N = this.log.length;

    // Find the largest N such that a majority have log entries up to N
    for (let i = N; i > this.commitIndex; i--) {
      if (this.log[i - 1].term !== this.currentTerm) {
        // Only commit entries from current term
        continue;
      }

      let count = 1; // Leader has it
      for (const matchIdx of this.matchIndex.values()) {
        if (matchIdx >= i) count++;
      }

      const majority = Math.floor(this.clusterConfig.peerIds.length / 2) + 1;
      if (count >= majority) {
        this.commitIndex = i;
        this.applyCommittedEntries();
        break;
      }
    }
  }

  /**
   * Client: submit command to cluster
   */
  async submitCommand(command: string): Promise<void> {
    if (this.state !== NodeState.Leader) {
      throw new Error('Not a leader. Redirect to actual leader.');
    }

    // Append to local log
    const entry: LogEntry = {
      index: this.log.length + 1,
      term: this.currentTerm,
      command,
    };
    this.log.push(entry);

    // Replicate to followers
    this.replicateLog();

    // Wait for commit
    await this.waitForCommit(entry.index);
  }

  private async waitForCommit(index: number): Promise<void> {
    return new Promise((resolve) => {
      const check = () => {
        if (this.commitIndex >= index) {
          resolve();
        } else {
          setTimeout(check, 50);
        }
      };
      check();
    });
  }
}

Python Implementation

# types.py
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class LogEntry:
    index: int
    term: int
    command: str

@dataclass
class AppendEntriesRequest:
    term: int
    leader_id: str
    prev_log_index: int
    prev_log_term: int
    entries: List[LogEntry]
    leader_commit: int

@dataclass
class AppendEntriesResponse:
    term: int
    success: bool
# raft_node.py
import asyncio
from enum import Enum
from typing import List, Dict, Optional

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

class RaftNode:
    def __init__(self, node_id: str, peer_ids: List[str]):
        self.node_id = node_id
        self.peer_ids = peer_ids

        # Persistent state
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: List[LogEntry] = []

        # Volatile state
        self.commit_index = 0
        self.last_applied = 0
        self.state = NodeState.FOLLOWER

        # Leader state
        self.next_index: Dict[str, int] = {}
        self.match_index: Dict[str, int] = {}

        # State machine
        self.state_machine = StateMachine()

        # Election timeout
        self.election_timeout: Optional[asyncio.Task] = None

    async def handle_append_entries(self, req: AppendEntriesRequest) -> AppendEntriesResponse:
        """Handle AppendEntries RPC from leader"""

        # Reply false if term < currentTerm
        if req.term < self.current_term:
            return AppendEntriesResponse(term=self.current_term, success=False)

        # Update current term if needed
        if req.term > self.current_term:
            self.current_term = req.term
            self.state = NodeState.FOLLOWER
            self.voted_for = None

        # Reset election timeout
        self.reset_election_timeout()

        # Check log consistency
        if req.prev_log_index > 0:
            if len(self.log) < req.prev_log_index:
                return AppendEntriesResponse(term=self.current_term, success=False)

            prev_entry = self.log[req.prev_log_index - 1]
            if prev_entry.term != req.prev_log_term:
                return AppendEntriesResponse(term=self.current_term, success=False)

        # Append new entries
        if req.entries:
            # Find first conflicting entry
            insert_index = req.prev_log_index
            for entry in req.entries:
                if insert_index < len(self.log):
                    existing = self.log[insert_index]
                    if existing.index == entry.index and existing.term == entry.term:
                        # Already matches, skip
                        insert_index += 1
                        continue
                    # Conflict! Delete from here and append
                    self.log = self.log[:insert_index]
                self.log.append(entry)
                insert_index += 1

        # Update commit index
        if req.leader_commit > self.commit_index:
            self.commit_index = min(req.leader_commit, len(self.log))
            await self.apply_committed_entries()

        return AppendEntriesResponse(term=self.current_term, success=True)

    async def apply_committed_entries(self):
        """Apply committed entries to state machine"""
        while self.last_applied < self.commit_index:
            self.last_applied += 1
            entry = self.log[self.last_applied - 1]
            self.state_machine.apply(entry)
            print(f"Node {self.node_id} applied: {entry.command}")

    async def replicate_log(self):
        """Leader: replicate log to followers"""
        if self.state != NodeState.LEADER:
            return

        for follower_id in self.peer_ids:
            next_idx = self.next_index.get(follower_id, 1)

            prev_log_index = next_idx - 1
            prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
            entries = self.log[next_idx - 1:]

            req = AppendEntriesRequest(
                term=self.current_term,
                leader_id=self.node_id,
                prev_log_index=prev_log_index,
                prev_log_term=prev_log_term,
                entries=entries,
                leader_commit=self.commit_index
            )

            await self.send_append_entries(follower_id, req)

    async def handle_append_entries_response(
        self,
        follower_id: str,
        resp: AppendEntriesResponse,
        req: AppendEntriesRequest
    ):
        """Leader: handle AppendEntries response"""
        if self.state != NodeState.LEADER:
            return

        if resp.term > self.current_term:
            # Follower has higher term, step down
            self.current_term = resp.term
            self.state = NodeState.FOLLOWER
            self.voted_for = None
            return

        if resp.success:
            # Update match index and next index
            last_index = req.prev_log_index + len(req.entries)
            self.match_index[follower_id] = last_index
            self.next_index[follower_id] = last_index + 1

            # Try to commit more entries
            await self.update_commit_index()
        else:
            # Follower's log is inconsistent, backtrack
            current_next = self.next_index.get(follower_id, 1)
            self.next_index[follower_id] = max(1, current_next - 1)

            # Retry immediately
            asyncio.create_task(self.replicate_log())

    async def update_commit_index(self):
        """Leader: update commit index if majority has entry"""
        if self.state != NodeState.LEADER:
            return

        N = len(self.log)

        # Find the largest N such that a majority have log entries up to N
        for i in range(N, self.commit_index, -1):
            if self.log[i - 1].term != self.current_term:
                # Only commit entries from current term
                continue

            count = 1  # Leader has it
            for match_idx in self.match_index.values():
                if match_idx >= i:
                    count += 1

            majority = len(self.peer_ids) // 2 + 1
            if count >= majority:
                self.commit_index = i
                await self.apply_committed_entries()
                break

    async def submit_command(self, command: str) -> None:
        """Client: submit command to cluster"""
        if self.state != NodeState.LEADER:
            raise Exception("Not a leader. Redirect to actual leader.")

        # Append to local log
        entry = LogEntry(
            index=len(self.log) + 1,
            term=self.current_term,
            command=command
        )
        self.log.append(entry)

        # Replicate to followers
        await self.replicate_log()

        # Wait for commit
        await self._wait_for_commit(entry.index)

    async def _wait_for_commit(self, index: int):
        """Wait for an entry to be committed"""
        while self.commit_index < index:
            await asyncio.sleep(0.05)
# state_machine.py
class StateMachine:
    """Simple key-value store state machine"""
    def __init__(self):
        self.data: Dict[str, str] = {}

    def apply(self, entry: LogEntry):
        """Apply a committed log entry to the state machine"""
        parts = entry.command.split()
        if parts[0] == "SET" and len(parts) == 3:
            key, value = parts[1], parts[2]
            self.data[key] = value
            print(f"Applied: {key} = {value}")
        elif parts[0] == "DELETE" and len(parts) == 2:
            key = parts[1]
            if key in self.data:
                del self.data[key]
                print(f"Deleted: {key}")

Testing Log Replication

TypeScript Test

// test-log-replication.ts
async function testLogReplication() {
  const nodes = [
    new RaftNode('node1', ['node2', 'node3']),
    new RaftNode('node2', ['node1', 'node3']),
    new RaftNode('node3', ['node1', 'node2']),
  ];

  // Simulate leader election (node1 wins)
  await nodes[0].becomeLeader();

  // Submit command to leader
  await nodes[0].submitCommand('SET x = 5');

  // Verify all nodes have the entry
  for (const node of nodes) {
    const entry = node.getLog()[0];
    console.log(`${node.nodeId}: ${entry.command}`);
  }
}

Python Test

# test_log_replication.py
import asyncio

async def test_log_replication():
    nodes = [
        RaftNode('node1', ['node2', 'node3']),
        RaftNode('node2', ['node1', 'node3']),
        RaftNode('node3', ['node1', 'node2']),
    ]

    # Simulate leader election (node1 wins)
    await nodes[0].become_leader()

    # Submit command to leader
    await nodes[0].submit_command('SET x = 5')

    # Verify all nodes have the entry
    for node in nodes:
        entry = node.get_log()[0]
        print(f"{node.node_id}: {entry.command}")

asyncio.run(test_log_replication())

Exercises

Exercise 1: Basic Log Replication

  1. Start a 3-node cluster
  2. Elect a leader
  3. Submit SET x = 10 to the leader
  4. Verify the entry is on all nodes
  5. Check commit index advancement

Expected Result: The entry appears on all nodes after being committed.

Exercise 2: Conflict Resolution

  1. Start a 3-node cluster
  2. Create a log divergence (manually edit follower logs)
  3. Have the leader replicate new entries
  4. Observe how the follower's log is corrected

Expected Result: The follower's conflicting entries are overwritten.

Exercise 3: Commit Index Safety

  1. Start a 5-node cluster
  2. Partition the network (2 nodes isolated)
  3. Submit commands to the leader
  4. Verify entries are committed with majority (3 nodes)
  5. Heal the partition
  6. Verify isolated nodes catch up

Expected Result: Commands commit with 3 nodes, isolated nodes catch up after healing.

Exercise 4: State Machine Application

  1. Implement a key-value store state machine
  2. Submit multiple SET commands
  3. Verify the state machine applies them in order
  4. Kill and restart a node
  5. Verify the state machine is rebuilt from the log

Expected Result: State machine reflects all committed commands, even after restart.


Common Pitfalls

PitfallSymptomSolution
Committing previous term entriesEntries get lostOnly commit entries from current term
Not applying entries in orderInconsistent stateApply from lastApplied+1 to commitIndex
Infinite conflict resolution loopCPU spikeEnsure nextIndex doesn't go below 1
Applying uncommitted entriesData loss on leader crashNever apply before commitIndex

Key Takeaways

  1. Log replication ensures all nodes execute the same commands in the same order
  2. AppendEntries RPC handles both replication and heartbeats
  3. Log matching property enables efficient conflict resolution
  4. Commit index tracks which entries are safely replicated
  5. State machine applies committed entries deterministically

Next: Complete Consensus System Implementation →

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

Consensus System Implementation

Session 10, Part 2 - 60 minutes

Learning Objectives

  • Build a complete Raft-based consensus system
  • Implement a state machine abstraction (key-value store)
  • Create client APIs for get/set operations
  • Deploy and test the full system with Docker Compose
  • Verify safety and liveness properties

Overview: Putting It All Together

In the previous chapters, we implemented Raft's two core components:

  1. Leader Election (Session 9) - Democratic voting to select a leader
  2. Log Replication (Session 10, Part 1) - Replicating commands across nodes

Now we combine them into a complete consensus system - a distributed key-value store that provides strong consistency guarantees.

┌────────────────────────────────────────────────────────────┐
│                  Raft Consensus System                     │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  Client  ──→  Leader  ──→  Log Replication  ──→  Followers │
│     │            │              │                    │      │
│     │            ▼              ▼                    ▼      │
│     │         Leader Election (if needed)                   │
│     │            │                                          │
│     ▼            ▼                                          ▼
│  State Machine (all nodes apply same commands)             │
│                                                             │
└────────────────────────────────────────────────────────────┘

System Architecture

graph TB
    subgraph "Client Layer"
        C1[Client 1]
        C2[Client 2]
    end

    subgraph "Raft Cluster"
        N1[Node 1: Leader]
        N2[Node 2: Follower]
        N3[Node 3: Follower]
    end

    subgraph "State Machine Layer"
        SM1[KV Store 1]
        SM2[KV Store 2]
        SM3[KV Store 3]
    end

    C1 -->|SET/GET| N1
    C2 -->|SET/GET| N1

    N1 <-->|AppendEntries RPC| N2
    N1 <-->|AppendEntries RPC| N3
    N2 <-->|RequestVote RPC| N3

    N1 --> SM1
    N2 --> SM2
    N3 --> SM3

    style N1 fill:#9f9
    style N2 fill:#fc9
    style N3 fill:#fc9

Complete TypeScript Implementation

Project Structure

typescript-raft/
├── package.json
├── tsconfig.json
├── src/
│   ├── types.ts              # Shared types
│   ├── state-machine.ts      # KV store state machine
│   ├── raft-node.ts          # Complete Raft implementation
│   ├── server.ts             # HTTP API server
│   └── index.ts              # Entry point
└── docker-compose.yml

package.json

{
  "name": "typescript-raft-kv-store",
  "version": "1.0.0",
  "description": "Distributed key-value store using Raft consensus",
  "main": "dist/index.js",
  "scripts": {
    "build": "tsc",
    "start": "node dist/index.js",
    "dev": "ts-node src/index.ts"
  },
  "dependencies": {
    "express": "^4.18.2",
    "axios": "^1.6.0"
  },
  "devDependencies": {
    "@types/express": "^4.17.21",
    "@types/node": "^20.10.0",
    "ts-node": "^10.9.2",
    "typescript": "^5.3.3"
  }
}

types.ts

// Node states
export enum NodeState {
  FOLLOWER = 'follower',
  CANDIDATE = 'candidate',
  LEADER = 'leader'
}

// Log entry
export interface LogEntry {
  index: number;
  term: number;
  command: string;
}

// RequestVote RPC
export interface RequestVoteRequest {
  term: number;
  candidateId: string;
  lastLogIndex: number;
  lastLogTerm: number;
}

export interface RequestVoteResponse {
  term: number;
  voteGranted: boolean;
}

// AppendEntries RPC
export interface AppendEntriesRequest {
  term: number;
  leaderId: string;
  prevLogIndex: number;
  prevLogTerm: number;
  entries: LogEntry[];
  leaderCommit: number;
}

export interface AppendEntriesResponse {
  term: number;
  success: boolean;
}

// Client commands
export interface SetCommand {
  type: 'SET';
  key: string;
  value: string;
}

export interface GetCommand {
  type: 'GET';
  key: string;
}

export interface DeleteCommand {
  type: 'DELETE';
  key: string;
}

export type Command = SetCommand | GetCommand | DeleteCommand;

state-machine.ts

import { LogEntry } from './types';

/**
 * Key-Value Store State Machine
 * Applies committed log entries to build consistent state
 */
export class KVStoreStateMachine {
  private data: Map<string, string> = new Map();

  /**
   * Apply a committed log entry to the state machine
   */
  apply(entry: LogEntry): void {
    try {
      const command = JSON.parse(entry.command);

      switch (command.type) {
        case 'SET':
          this.data.set(command.key, command.value);
          console.log(`[State Machine] SET ${command.key} = ${command.value}`);
          break;

        case 'DELETE':
          if (this.data.has(command.key)) {
            this.data.delete(command.key);
            console.log(`[State Machine] DELETE ${command.key}`);
          }
          break;

        case 'GET':
          // Read-only commands don't modify state
          break;

        default:
          console.warn(`[State Machine] Unknown command type: ${command.type}`);
      }
    } catch (error) {
      console.error(`[State Machine] Failed to apply entry:`, error);
    }
  }

  /**
   * Get a value from the state machine
   */
  get(key: string): string | undefined {
    return this.data.get(key);
  }

  /**
   * Get all key-value pairs
   */
  getAll(): Record<string, string> {
    return Object.fromEntries(this.data);
  }

  /**
   * Clear the state machine (for testing)
   */
  clear(): void {
    this.data.clear();
  }
}

raft-node.ts

import {
  NodeState,
  LogEntry,
  RequestVoteRequest,
  RequestVoteResponse,
  AppendEntriesRequest,
  AppendEntriesResponse,
  Command
} from './types';
import { KVStoreStateMachine } from './state-machine';
import axios from 'axios';

interface ClusterConfig {
  nodeId: string;
  peerIds: string[];
  electionTimeoutMin: number;
  electionTimeoutMax: number;
  heartbeatInterval: number;
}

export class RaftNode {
  // Configuration
  private config: ClusterConfig;

  // Persistent state (survives restarts)
  private currentTerm: number = 0;
  private votedFor: string | null = null;
  private log: LogEntry[] = [];

  // Volatile state (reset on restart)
  private commitIndex: number = 0;
  private lastApplied: number = 0;
  private state: NodeState = NodeState.FOLLOWER;

  // Leader state (reset on election)
  private nextIndex: Map<string, number> = new Map();
  private matchIndex: Map<string, number> = new Map();

  // Components
  private stateMachine: KVStoreStateMachine;
  private leaderId: string | null = null;

  // Timers
  private electionTimer: NodeJS.Timeout | null = null;
  private heartbeatTimer: NodeJS.Timeout | null = null;

  constructor(config: ClusterConfig) {
    this.config = config;
    this.stateMachine = new KVStoreStateMachine();
    this.resetElectionTimeout();
  }

  // ========== Public API ==========

  /**
   * Client: Submit a command to the cluster
   */
  async submitCommand(command: Command): Promise<any> {
    // Redirect to leader if not leader
    if (this.state !== NodeState.LeADER) {
      if (this.leaderId) {
        throw new Error(`Not a leader. Please redirect to ${this.leaderId}`);
      }
      throw new Error('No leader known. Please retry.');
    }

    // Handle GET commands (read-only, no consensus needed)
    if (command.type === 'GET') {
      return this.stateMachine.get(command.key);
    }

    // Append to local log
    const entry: LogEntry = {
      index: this.log.length + 1,
      term: this.currentTerm,
      command: JSON.stringify(command)
    };
    this.log.push(entry);

    // Replicate to followers
    this.replicateLog();

    // Wait for commit
    await this.waitForCommit(entry.index);

    // Return result
    if (command.type === 'SET') {
      return { key: command.key, value: command.value };
    } else if (command.type === 'DELETE') {
      return { key: command.key, deleted: true };
    }
  }

  /**
   * Start the node (begin election timeout)
   */
  start(): void {
    this.resetElectionTimeout();
  }

  /**
   * Stop the node (clear timers)
   */
  stop(): void {
    if (this.electionTimer) clearTimeout(this.electionTimer);
    if (this.heartbeatTimer) clearTimeout(this.heartbeatTimer);
  }

  // ========== RPC Handlers ==========

  /**
   * Handle RequestVote RPC
   */
  handleRequestVote(req: RequestVoteRequest): RequestVoteResponse {
    // If term < currentTerm, reject
    if (req.term < this.currentTerm) {
      return { term: this.currentTerm, voteGranted: false };
    }

    // If term > currentTerm, update and become follower
    if (req.term > this.currentTerm) {
      this.currentTerm = req.term;
      this.state = NodeState.FOLLOWER;
      this.votedFor = null;
    }

    // Grant vote if:
    // 1. We haven't voted this term, OR
    // 2. We voted for this candidate
    // AND candidate's log is at least as up-to-date as ours
    const logOk = req.lastLogTerm > this.getLastLogTerm() ||
      (req.lastLogTerm === this.getLastLogTerm() && req.lastLogIndex >= this.log.length);

    const canVote = this.votedFor === null || this.votedFor === req.candidateId;

    if (canVote && logOk) {
      this.votedFor = req.candidateId;
      this.resetElectionTimeout();
      return { term: this.currentTerm, voteGranted: true };
    }

    return { term: this.currentTerm, voteGranted: false };
  }

  /**
   * Handle AppendEntries RPC
   */
  handleAppendEntries(req: AppendEntriesRequest): AppendEntriesResponse {
    // If term < currentTerm, reject
    if (req.term < this.currentTerm) {
      return { term: this.currentTerm, success: false };
    }

    // Recognize leader
    this.leaderId = req.leaderId;

    // If term > currentTerm, update and become follower
    if (req.term > this.currentTerm) {
      this.currentTerm = req.term;
      this.state = NodeState.FOLLOWER;
      this.votedFor = null;
    }

    // Reset election timeout
    this.resetElectionTimeout();

    // Check log consistency
    if (req.prevLogIndex > 0) {
      if (this.log.length < req.prevLogIndex) {
        return { term: this.currentTerm, success: false };
      }

      const prevEntry = this.log[req.prevLogIndex - 1];
      if (prevEntry.term !== req.prevLogTerm) {
        return { term: this.currentTerm, success: false };
      }
    }

    // Append new entries
    if (req.entries.length > 0) {
      let insertIndex = req.prevLogIndex;
      for (const entry of req.entries) {
        if (insertIndex < this.log.length) {
          const existing = this.log[insertIndex];
          if (existing.index === entry.index && existing.term === entry.term) {
            insertIndex++;
            continue;
          }
          // Conflict! Delete from here
          this.log = this.log.slice(0, insertIndex);
        }
        this.log.push(entry);
        insertIndex++;
      }
    }

    // Update commit index
    if (req.leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(req.leaderCommit, this.log.length);
      this.applyCommittedEntries();
    }

    return { term: this.currentTerm, success: true };
  }

  // ========== Private Methods ==========

  /**
   * Start election (convert to candidate)
   */
  private startElection(): void {
    this.state = NodeState.CANDIDATE;
    this.currentTerm++;
    this.votedFor = this.config.nodeId;
    this.leaderId = null;

    console.log(`[Node ${this.config.nodeId}] Starting election for term ${this.currentTerm}`);

    // Request votes from peers
    const req: RequestVoteRequest = {
      term: this.currentTerm,
      candidateId: this.config.nodeId,
      lastLogIndex: this.log.length,
      lastLogTerm: this.getLastLogTerm()
    };

    let votesReceived = 1; // Vote for self
    const majority = Math.floor(this.config.peerIds.length / 2) + 1;

    for (const peerId of this.config.peerIds) {
      this.sendRequestVote(peerId, req).then(resp => {
        if (resp.voteGranted) {
          votesReceived++;
          if (votesReceived >= majority && this.state === NodeState.CANDIDATE) {
            this.becomeLeader();
          }
        } else if (resp.term > this.currentTerm) {
          this.currentTerm = resp.term;
          this.state = NodeState.FOLLOWER;
          this.votedFor = null;
        }
      }).catch(() => {
        // Peer unavailable, ignore
      });
    }

    // Reset election timeout for next round
    this.resetElectionTimeout();
  }

  /**
   * Become leader after winning election
   */
  private becomeLeader(): void {
    this.state = NodeState.LEADER;
    this.leaderId = this.config.nodeId;
    console.log(`[Node ${this.config.nodeId}] Became leader for term ${this.currentTerm}`);

    // Initialize leader state
    for (const peerId of this.config.peerIds) {
      this.nextIndex.set(peerId, this.log.length + 1);
      this.matchIndex.set(peerId, 0);
    }

    // Start sending heartbeats
    this.startHeartbeats();
  }

  /**
   * Send heartbeats to all followers
   */
  private startHeartbeats(): void {
    if (this.heartbeatTimer) clearInterval(this.heartbeatTimer);

    this.heartbeatTimer = setInterval(() => {
      if (this.state === NodeState.LEADER) {
        this.replicateLog();
      }
    }, this.config.heartbeatInterval);
  }

  /**
   * Replicate log to followers (also sends heartbeats)
   */
  private replicateLog(): void {
    if (this.state !== NodeState.LEADER) return;

    for (const followerId of this.config.peerIds) {
      const nextIdx = this.nextIndex.get(followerId) || 1;
      const prevLogIndex = nextIdx - 1;
      const prevLogTerm = prevLogIndex > 0 ? this.log[prevLogIndex - 1].term : 0;
      const entries = this.log.slice(nextIdx - 1);

      const req: AppendEntriesRequest = {
        term: this.currentTerm,
        leaderId: this.config.nodeId,
        prevLogIndex,
        prevLogTerm,
        entries,
        leaderCommit: this.commitIndex
      };

      this.sendAppendEntries(followerId, req).then(resp => {
        if (this.state !== NodeState.LEADER) return;

        if (resp.term > this.currentTerm) {
          this.currentTerm = resp.term;
          this.state = NodeState.FOLLOWER;
          this.votedFor = null;
          if (this.heartbeatTimer) clearInterval(this.heartbeatTimer);
          return;
        }

        if (resp.success) {
          const lastIndex = prevLogIndex + entries.length;
          this.matchIndex.set(followerId, lastIndex);
          this.nextIndex.set(followerId, lastIndex + 1);
          this.updateCommitIndex();
        } else {
          const currentNext = this.nextIndex.get(followerId) || 1;
          this.nextIndex.set(followerId, Math.max(1, currentNext - 1));
        }
      }).catch(() => {
        // Follower unavailable, will retry
      });
    }
  }

  /**
   * Update commit index if majority has entry
   */
  private updateCommitIndex(): void {
    if (this.state !== NodeState.LEADER) return;

    const N = this.log.length;
    const majority = Math.floor(this.config.peerIds.length / 2) + 1;

    for (let i = N; i > this.commitIndex; i--) {
      if (this.log[i - 1].term !== this.currentTerm) continue;

      let count = 1; // Leader has it
      for (const matchIdx of this.matchIndex.values()) {
        if (matchIdx >= i) count++;
      }

      if (count >= majority) {
        this.commitIndex = i;
        this.applyCommittedEntries();
        break;
      }
    }
  }

  /**
   * Apply committed entries to state machine
   */
  private applyCommittedEntries(): void {
    while (this.lastApplied < this.commitIndex) {
      this.lastApplied++;
      const entry = this.log[this.lastApplied - 1];
      this.stateMachine.apply(entry);
    }
  }

  /**
   * Wait for an entry to be committed
   */
  private async waitForCommit(index: number): Promise<void> {
    return new Promise((resolve) => {
      const check = () => {
        if (this.commitIndex >= index) {
          resolve();
        } else {
          setTimeout(check, 50);
        }
      };
      check();
    });
  }

  /**
   * Reset election timeout with random value
   */
  private resetElectionTimeout(): void {
    if (this.electionTimer) clearTimeout(this.electionTimer);

    const timeout = this.randomTimeout();
    this.electionTimer = setTimeout(() => {
      if (this.state !== NodeState.LEADER) {
        this.startElection();
      }
    }, timeout);
  }

  private randomTimeout(): number {
    const min = this.config.electionTimeoutMin;
    const max = this.config.electionTimeoutMax;
    return Math.floor(Math.random() * (max - min + 1)) + min;
  }

  private getLastLogTerm(): number {
    if (this.log.length === 0) return 0;
    return this.log[this.log.length - 1].term;
  }

  // ========== Network Layer (simplified) ==========

  private async sendRequestVote(peerId: string, req: RequestVoteRequest): Promise<RequestVoteResponse> {
    const url = `http://${peerId}/raft/request-vote`;
    const response = await axios.post(url, req);
    return response.data;
  }

  private async sendAppendEntries(peerId: string, req: AppendEntriesRequest): Promise<AppendEntriesResponse> {
    const url = `http://${peerId}/raft/append-entries`;
    const response = await axios.post(url, req);
    return response.data;
  }

  // ========== Debug Methods ==========

  getState() {
    return {
      nodeId: this.config.nodeId,
      state: this.state,
      term: this.currentTerm,
      leaderId: this.leaderId,
      logLength: this.log.length,
      commitIndex: this.commitIndex,
      stateMachine: this.stateMachine.getAll()
    };
  }
}

server.ts

import express from 'express';
import { RaftNode } from './raft-node';
import { Command } from './types';

export function createServer(node: RaftNode, port: number): express.Application {
  const app = express();
  app.use(express.json());

  // Raft RPC endpoints
  app.post('/raft/request-vote', (req, res) => {
    const response = node.handleRequestVote(req.body);
    res.json(response);
  });

  app.post('/raft/append-entries', (req, res) => {
    const response = node.handleAppendEntries(req.body);
    res.json(response);
  });

  // Client API endpoints
  app.get('/kv/:key', (req, res) => {
    const command: Command = { type: 'GET', key: req.params.key };
    node.submitCommand(command)
      .then(value => res.json({ key: req.params.key, value }))
      .catch(err => res.status(500).json({ error: err.message }));
  });

  app.post('/kv', (req, res) => {
    const command: Command = { type: 'SET', key: req.body.key, value: req.body.value };
    node.submitCommand(command)
      .then(result => res.json(result))
      .catch(err => res.status(500).json({ error: err.message }));
  });

  app.delete('/kv/:key', (req, res) => {
    const command: Command = { type: 'DELETE', key: req.params.key };
    node.submitCommand(command)
      .then(result => res.json(result))
      .catch(err => res.status(500).json({ error: err.message }));
  });

  // Debug endpoint
  app.get('/debug', (req, res) => {
    res.json(node.getState());
  });

  return app;
}

index.ts

import { RaftNode } from './raft-node';
import { createServer } from './server';

const NODE_ID = process.env.NODE_ID || 'node1';
const PEER_IDS = process.env.PEER_IDS?.split(',') || [];
const PORT = parseInt(process.env.PORT || '3000');

const node = new RaftNode({
  nodeId: NODE_ID,
  peerIds: PEER_IDS,
  electionTimeoutMin: 150,
  electionTimeoutMax: 300,
  heartbeatInterval: 50
});

node.start();

const app = createServer(node, PORT);
app.listen(PORT, () => {
  console.log(`Node ${NODE_ID} listening on port ${PORT}`);
  console.log(`Peers: ${PEER_IDS.join(', ')}`);
});

docker-compose.yml

version: '3.8'

services:
  node1:
    build: .
    container_name: raft-node1
    environment:
      - NODE_ID=node1
      - PEER_IDS=node2:3000,node3:3000
      - PORT=3000
    ports:
      - "3001:3000"

  node2:
    build: .
    container_name: raft-node2
    environment:
      - NODE_ID=node2
      - PEER_IDS=node1:3000,node3:3000
      - PORT=3000
    ports:
      - "3002:3000"

  node3:
    build: .
    container_name: raft-node3
    environment:
      - NODE_ID=node3
      - PEER_IDS=node1:3000,node2:3000
      - PORT=3000
    ports:
      - "3003:3000"

Dockerfile

FROM node:20-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

EXPOSE 3000

CMD ["npm", "start"]

Complete Python Implementation

Project Structure

python-raft/
├── requirements.txt
├── src/
│   ├── types.py              # Shared types
│   ├── state_machine.py      # KV store state machine
│   ├── raft_node.py          # Complete Raft implementation
│   ├── server.py             # Flask API server
│   └── __init__.py
├── app.py                    # Entry point
└── docker-compose.yml

requirements.txt

flask==3.0.0
requests==2.31.0
gunicorn==21.2.0

types.py

from enum import Enum
from dataclasses import dataclass
from typing import List, Optional, Union

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    index: int
    term: int
    command: str

@dataclass
class RequestVoteRequest:
    term: int
    candidate_id: str
    last_log_index: int
    last_log_term: int

@dataclass
class RequestVoteResponse:
    term: int
    vote_granted: bool

@dataclass
class AppendEntriesRequest:
    term: int
    leader_id: str
    prev_log_index: int
    prev_log_term: int
    entries: List[LogEntry]
    leader_commit: int

@dataclass
class AppendEntriesResponse:
    term: int
    success: bool

@dataclass
class SetCommand:
    type: str = 'SET'
    key: str = ''
    value: str = ''

@dataclass
class GetCommand:
    type: str = 'GET'
    key: str = ''

@dataclass
class DeleteCommand:
    type: str = 'DELETE'
    key: str = ''

Command = Union[SetCommand, GetCommand, DeleteCommand]

state_machine.py

from typing import Dict, Optional
import json
from .types import LogEntry

class KVStoreStateMachine:
    """Key-Value Store State Machine"""

    def __init__(self):
        self.data: Dict[str, str] = {}

    def apply(self, entry: LogEntry) -> None:
        """Apply a committed log entry to the state machine"""
        try:
            command = json.loads(entry.command)

            if command['type'] == 'SET':
                self.data[command['key']] = command['value']
                print(f"[State Machine] SET {command['key']} = {command['value']}")

            elif command['type'] == 'DELETE':
                if command['key'] in self.data:
                    del self.data[command['key']]
                    print(f"[State Machine] DELETE {command['key']}")

            elif command['type'] == 'GET':
                # Read-only, no state change
                pass

        except Exception as e:
            print(f"[State Machine] Failed to apply entry: {e}")

    def get(self, key: str) -> Optional[str]:
        """Get a value from the state machine"""
        return self.data.get(key)

    def get_all(self) -> Dict[str, str]:
        """Get all key-value pairs"""
        return dict(self.data)

    def clear(self) -> None:
        """Clear the state machine (for testing)"""
        self.data.clear()

raft_node.py

import asyncio
import random
import json
from typing import Dict, List, Optional
from .types import (
    NodeState, LogEntry, RequestVoteRequest, RequestVoteResponse,
    AppendEntriesRequest, AppendEntriesResponse, Command
)
from .state_machine import KVStoreStateMachine
import requests

class ClusterConfig:
    nodeId: str
    peer_ids: List[str]
    election_timeout_min: int
    election_timeout_max: int
    heartbeat_interval: int

    def __init__(self, node_id: str, peer_ids: List[str],
                 election_timeout_min: int = 150,
                 election_timeout_max: int = 300,
                 heartbeat_interval: int = 50):
        self.nodeId = node_id
        self.peer_ids = peer_ids
        self.election_timeout_min = election_timeout_min
        self.election_timeout_max = election_timeout_max
        self.heartbeat_interval = heartbeat_interval

class RaftNode:
    def __init__(self, config: ClusterConfig):
        self.config = config
        self.state_machine = KVStoreStateMachine()

        # Persistent state
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: List[LogEntry] = []

        # Volatile state
        self.commit_index = 0
        self.last_applied = 0
        self.state = NodeState.FOLLOWER
        self.leader_id: Optional[str] = None

        # Leader state
        self.next_index: Dict[str, int] = {}
        self.match_index: Dict[str, int] = {}

        # Timers
        self.election_task: Optional[asyncio.Task] = None
        self.heartbeat_task: Optional[asyncio.Task] = None

    # ========== Public API ==========

    async def submit_command(self, command: Command) -> any:
        """Client: Submit a command to the cluster"""

        # Redirect to leader if not leader
        if self.state != NodeState.LEADER:
            if self.leader_id:
                raise Exception(f"Not a leader. Please redirect to {self.leader_id}")
            raise Exception("No leader known. Please retry.")

        # Handle GET commands (read-only)
        if command.type == 'GET':
            return self.state_machine.get(command.key)

        # Append to local log
        entry = LogEntry(
            index=len(self.log) + 1,
            term=self.current_term,
            command=json.dumps(command.__dict__)
        )
        self.log.append(entry)

        # Replicate to followers
        await self.replicate_log()

        # Wait for commit
        await self._wait_for_commit(entry.index)

        # Return result
        if command.type == 'SET':
            return {"key": command.key, "value": command.value}
        elif command.type == 'DELETE':
            return {"key": command.key, "deleted": True}

    def start(self):
        """Start the node"""
        asyncio.create_task(self._election_loop())

    def stop(self):
        """Stop the node"""
        if self.election_task:
            self.election_task.cancel()
        if self.heartbeat_task:
            self.heartbeat_task.cancel()

    # ========== RPC Handlers ==========

    def handle_request_vote(self, req: RequestVoteRequest) -> RequestVoteResponse:
        """Handle RequestVote RPC"""

        if req.term < self.current_term:
            return RequestVoteResponse(term=self.current_term, vote_granted=False)

        if req.term > self.current_term:
            self.current_term = req.term
            self.state = NodeState.FOLLOWER
            self.voted_for = None

        log_ok = (req.last_log_term > self._get_last_log_term() or
                  (req.last_log_term == self._get_last_log_term() and
                   req.last_log_index >= len(self.log)))

        can_vote = self.voted_for is None or self.voted_for == req.candidate_id

        if can_vote and log_ok:
            self.voted_for = req.candidate_id
            return RequestVoteResponse(term=self.current_term, vote_granted=True)

        return RequestVoteResponse(term=self.current_term, vote_granted=False)

    def handle_append_entries(self, req: AppendEntriesRequest) -> AppendEntriesResponse:
        """Handle AppendEntries RPC"""

        if req.term < self.current_term:
            return AppendEntriesResponse(term=self.current_term, success=False)

        # Recognize leader
        self.leader_id = req.leader_id

        if req.term > self.current_term:
            self.current_term = req.term
            self.state = NodeState.FOLLOWER
            self.voted_for = None

        # Check log consistency
        if req.prev_log_index > 0:
            if len(self.log) < req.prev_log_index:
                return AppendEntriesResponse(term=self.current_term, success=False)

            prev_entry = self.log[req.prev_log_index - 1]
            if prev_entry.term != req.prev_log_term:
                return AppendEntriesResponse(term=self.current_term, success=False)

        # Append new entries
        if req.entries:
            insert_index = req.prev_log_index
            for entry in req.entries:
                if insert_index < len(self.log):
                    existing = self.log[insert_index]
                    if existing.index == entry.index and existing.term == entry.term:
                        insert_index += 1
                        continue
                    self.log = self.log[:insert_index]
                self.log.append(entry)
                insert_index += 1

        # Update commit index
        if req.leader_commit > self.commit_index:
            self.commit_index = min(req.leader_commit, len(self.log))
            self._apply_committed_entries()

        return AppendEntriesResponse(term=self.current_term, success=True)

    # ========== Private Methods ==========

    async def _election_loop(self):
        """Election timeout loop"""
        while True:
            timeout = self._random_timeout()
            await asyncio.sleep(timeout / 1000)

            if self.state != NodeState.LEADER:
                await self._start_election()

    async def _start_election(self):
        """Start election (convert to candidate)"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.config.nodeId
        self.leader_id = None

        print(f"[Node {self.config.nodeId}] Starting election for term {self.current_term}")

        req = RequestVoteRequest(
            term=self.current_term,
            candidate_id=self.config.nodeId,
            last_log_index=len(self.log),
            last_log_term=self._get_last_log_term()
        )

        votes_received = 1  # Vote for self
        majority = len(self.config.peer_ids) // 2 + 1

        tasks = []
        for peer_id in self.config.peer_ids:
            tasks.append(self._send_request_vote(peer_id, req))

        results = await asyncio.gather(*tasks, return_exceptions=True)

        for result in results:
            if isinstance(result, RequestVoteResponse):
                if result.vote_granted:
                    votes_received += 1
                    if votes_received >= majority and self.state == NodeState.CANDIDATE:
                        self._become_leader()
                elif result.term > self.current_term:
                    self.current_term = result.term
                    self.state = NodeState.FOLLOWER
                    self.voted_for = None

    def _become_leader(self):
        """Become leader after winning election"""
        self.state = NodeState.LEADER
        self.leader_id = self.config.nodeId
        print(f"[Node {self.config.nodeId}] Became leader for term {self.current_term}")

        # Initialize leader state
        for peer_id in self.config.peer_ids:
            self.next_index[peer_id] = len(self.log) + 1
            self.match_index[peer_id] = 0

        # Start heartbeats
        asyncio.create_task(self._heartbeat_loop())

    async def _heartbeat_loop(self):
        """Send heartbeats to followers"""
        while self.state == NodeState.LEADER:
            await self.replicate_log()
            await asyncio.sleep(self.config.heartbeat_interval / 1000)

    async def replicate_log(self):
        """Replicate log to followers"""
        if self.state != NodeState.LEADER:
            return

        tasks = []
        for follower_id in self.config.peer_ids:
            next_idx = self.next_index.get(follower_id, 1)
            prev_log_index = next_idx - 1
            prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
            entries = self.log[next_idx - 1:]

            req = AppendEntriesRequest(
                term=self.current_term,
                leader_id=self.config.nodeId,
                prev_log_index=prev_log_index,
                prev_log_term=prev_log_term,
                entries=entries,
                leader_commit=self.commit_index
            )

            tasks.append(self._send_append_entries(follower_id, req))

        results = await asyncio.gather(*tasks, return_exceptions=True)

        for i, result in enumerate(results):
            follower_id = self.config.peer_ids[i]
            if isinstance(result, AppendEntriesResponse):
                if result.term > self.current_term:
                    self.current_term = result.term
                    self.state = NodeState.FOLLOWER
                    self.voted_for = None
                    return

                if result.success:
                    last_index = self.log[len(self.log) - 1].index if self.log else 0
                    self.match_index[follower_id] = last_index
                    self.next_index[follower_id] = last_index + 1
                    await self._update_commit_index()
                else:
                    current_next = self.next_index.get(follower_id, 1)
                    self.next_index[follower_id] = max(1, current_next - 1)

    async def _update_commit_index(self):
        """Update commit index if majority has entry"""
        if self.state != NodeState.LEADER:
            return

        N = len(self.log)
        majority = len(self.config.peer_ids) // 2 + 1

        for i in range(N, self.commit_index, -1):
            if self.log[i - 1].term != self.current_term:
                continue

            count = 1  # Leader has it
            for match_idx in self.match_index.values():
                if match_idx >= i:
                    count += 1

            if count >= majority:
                self.commit_index = i
                self._apply_committed_entries()
                break

    def _apply_committed_entries(self):
        """Apply committed entries to state machine"""
        while self.last_applied < self.commit_index:
            self.last_applied += 1
            entry = self.log[self.last_applied - 1]
            self.state_machine.apply(entry)

    async def _wait_for_commit(self, index: int):
        """Wait for an entry to be committed"""
        while self.commit_index < index:
            await asyncio.sleep(0.05)

    def _random_timeout(self) -> int:
        """Generate random election timeout"""
        return random.randint(
            self.config.election_timeout_min,
            self.config.election_timeout_max
        )

    def _get_last_log_term(self) -> int:
        """Get the term of the last log entry"""
        if not self.log:
            return 0
        return self.log[-1].term

    # ========== Network Layer ==========

    async def _send_request_vote(self, peer_id: str, req: RequestVoteRequest) -> RequestVoteResponse:
        """Send RequestVote RPC to peer"""
        url = f"http://{peer_id}/raft/request-vote"
        try:
            response = requests.post(url, json=req.__dict__, timeout=1)
            return RequestVoteResponse(**response.json())
        except:
            return RequestVoteResponse(term=self.current_term, vote_granted=False)

    async def _send_append_entries(self, peer_id: str, req: AppendEntriesRequest) -> AppendEntriesResponse:
        """Send AppendEntries RPC to peer"""
        url = f"http://{peer_id}/raft/append-entries"
        try:
            data = {
                'term': req.term,
                'leaderId': req.leader_id,
                'prevLogIndex': req.prev_log_index,
                'prevLogTerm': req.prev_log_term,
                'entries': [e.__dict__ for e in req.entries],
                'leaderCommit': req.leader_commit
            }
            response = requests.post(url, json=data, timeout=1)
            return AppendEntriesResponse(**response.json())
        except:
            return AppendEntriesResponse(term=self.current_term, success=False)

    # ========== Debug Methods ==========

    def get_state(self) -> dict:
        """Get node state for debugging"""
        return {
            'nodeId': self.config.nodeId,
            'state': self.state.value,
            'term': self.current_term,
            'leaderId': self.leader_id,
            'logLength': len(self.log),
            'commitIndex': self.commit_index,
            'stateMachine': self.state_machine.get_all()
        }

server.py

from flask import Flask, request, jsonify
from .raft_node import RaftNode, ClusterConfig

def create_server(node: RaftNode):
    app = Flask(__name__)

    # Raft RPC endpoints
    @app.route('/raft/request-vote', methods=['POST'])
    def request_vote():
        response = node.handle_request_vote(
            RequestVoteResponse(**request.json)
        )
        return jsonify(response.__dict__)

    @app.route('/raft/append-entries', methods=['POST'])
    def append_entries():
        # Convert request to proper format
        data = request.json
        entries = [LogEntry(**e) for e in data.get('entries', [])]
        req = AppendEntriesRequest(
            term=data['term'],
            leader_id=data['leaderId'],
            prev_log_index=data['prevLogIndex'],
            prev_log_term=data['prevLogTerm'],
            entries=entries,
            leader_commit=data['leaderCommit']
        )
        response = node.handle_append_entries(req)
        return jsonify(response.__dict__)

    # Client API endpoints
    @app.route('/kv/<key>', methods=['GET'])
    def get_key(key):
        command = GetCommand(key=key)
        try:
            value = asyncio.run(node.submit_command(command))
            return jsonify({'key': key, 'value': value})
        except Exception as e:
            return jsonify({'error': str(e)}), 500

    @app.route('/kv', methods=['POST'])
    def set_key():
        command = SetCommand(key=request.json['key'], value=request.json['value'])
        try:
            result = asyncio.run(node.submit_command(command))
            return jsonify(result)
        except Exception as e:
            return jsonify({'error': str(e)}), 500

    @app.route('/kv/<key>', methods=['DELETE'])
    def delete_key(key):
        command = DeleteCommand(key=key)
        try:
            result = asyncio.run(node.submit_command(command))
            return jsonify(result)
        except Exception as e:
            return jsonify({'error': str(e)}), 500

    # Debug endpoint
    @app.route('/debug', methods=['GET'])
    def debug():
        return jsonify(node.get_state())

    return app

app.py

import os
from src.types import ClusterConfig
from src.raft_node import RaftNode
from src.server import create_server

NODE_ID = os.getenv('NODE_ID', 'node1')
PEER_IDS = os.getenv('PEER_IDS', '').split(',') if os.getenv('PEER_IDS') else []
PORT = int(os.getenv('PORT', '5000'))

config = ClusterConfig(
    node_id=NODE_ID,
    peer_ids=PEER_IDS
)

node = RaftNode(config)
node.start()

app = create_server(node)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=PORT)

docker-compose.yml (Python)

version: '3.8'

services:
  node1:
    build: .
    container_name: python-raft-node1
    environment:
      - NODE_ID=node1
      - PEER_IDS=node2:5000,node3:5000
      - PORT=5000
    ports:
      - "5001:5000"

  node2:
    build: .
    container_name: python-raft-node2
    environment:
      - NODE_ID=node2
      - PEER_IDS=node1:5000,node3:5000
      - PORT=5000
    ports:
      - "5002:5000"

  node3:
    build: .
    container_name: python-raft-node3
    environment:
      - NODE_ID=node3
      - PEER_IDS=node1:5000,node2:5000
      - PORT=5000
    ports:
      - "5003:5000"

Dockerfile (Python)

FROM python:3.11-alpine

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 5000

CMD ["gunicorn", "-b", "0.0.0.0:5000", "app:app"]

Running the System

TypeScript

# Build
npm run build

# Run with Docker Compose
docker-compose up

# Test the cluster
curl -X POST http://localhost:3001/kv -H "Content-Type: application/json" -d '{"key":"foo","value":"bar"}'
curl http://localhost:3001/kv/foo
curl http://localhost:3002/debug  # Check node state

Python

# Run with Docker Compose
docker-compose up

# Test the cluster
curl -X POST http://localhost:5001/kv -H "Content-Type: application/json" -d '{"key":"foo","value":"bar"}'
curl http://localhost:5001/kv/foo
curl http://localhost:5002/debug  # Check node state

Exercises

Exercise 1: Basic Operations

  1. Start the 3-node cluster
  2. Wait for leader election
  3. SET key=value on the leader
  4. GET the key from all nodes
  5. Verify all nodes return the same value

Expected Result: All nodes return the committed value.

Exercise 2: Leader Failover

  1. Start the cluster and write some data
  2. Kill the leader container
  3. Observe a new leader being elected
  4. Continue writing data
  5. Restart the old leader
  6. Verify it catches up

Expected Result: System continues operating with new leader, old leader rejoins as follower.

Exercise 3: Network Partition

  1. Start a 5-node cluster
  2. Isolate 2 nodes (simulate partition)
  3. Verify majority (3 nodes) can still commit
  4. Heal the partition
  5. Verify isolated nodes catch up

Expected Result: Majority side continues, minority cannot commit, rejoin works.

Exercise 4: Persistence Test

  1. Write data to the cluster
  2. Stop all nodes
  3. Restart all nodes
  4. Verify data is recovered

Expected Result: All data survives restart.


Common Pitfalls

PitfallSymptomSolution
Reading from followersStale readsAlways read from leader or implement lease reads
No heartbeatsUnnecessary electionsEnsure heartbeat timer runs continuously
Client timeoutFailed writesWait for commit, don't return immediately
Split brainMultiple leadersRandomized timeouts + voting rules prevent this

Key Takeaways

  1. Complete Raft combines leader election + log replication for consensus
  2. State machine applies committed commands deterministically
  3. Client API provides transparent access to the distributed system
  4. Failover is automatic - new leader elected when old one fails
  5. Safety guarantees ensure no conflicting commits

Congratulations! You've completed the Consensus System. You now understand one of the hardest concepts in distributed systems!

Next: Reference Materials →

🧠 Chapter Quiz

Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.

Docker Setup

This guide covers installing Docker and Docker Compose for running the course examples.

Installing Docker

Linux

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

macOS

Download Docker Desktop from docker.com

Windows

Download Docker Desktop from docker.com

Verify Installation

docker --version
docker-compose --version

Running Course Examples

Each chapter includes a Docker Compose file:

cd examples/01-queue
docker-compose up

Common Commands

# Start services
docker-compose up

# Start in background
docker-compose up -d

# View logs
docker-compose logs

# Stop services
docker-compose down

# Rebuild after code changes
docker-compose up --build

Troubleshooting

See Troubleshooting for common issues.

Troubleshooting

Common issues and solutions when working with the course examples.

Docker Issues

Port Already in Use

Error: bind: address already in use

Solution: Change the port in docker-compose.yml or stop the conflicting service.

Permission Denied

Error: permission denied while trying to connect to the Docker daemon

Solution: Add your user to the docker group:

sudo usermod -aG docker $USER
newgrp docker

Build Issues

TypeScript: Module Not Found

Solution: Install dependencies:

npm install

Python: Module Not Found

Solution: Install dependencies:

pip install -r requirements.txt

Runtime Issues

Connection Refused

Solution: Check that all services are running:

docker-compose ps

Node Can't Connect to Peers

Solution: Verify network configuration in docker-compose.yml. Ensure all nodes are on the same network.

Getting Help

If you encounter issues not covered here:

  1. Check the Docker logs: docker-compose logs
  2. Verify your Docker installation: docker --version
  3. See Further Reading for additional resources

Further Reading

Resources for deepening your understanding of distributed systems.

Books

TitleAuthorFocus
Designing Data-Intensive ApplicationsMartin KleppmannModern database and distributed system design
Distributed Systems: Principles and ParadigmsTanenbaum & van SteenAcademic foundations
Introduction to Reliable Distributed ProgrammingCachin, Guerraoui, RodriguesFormal foundations

Papers

Foundational

  • Brewer, E. A. (2000). "Towards robust distributed systems"
  • Gilbert, S. & Lynch, N. (2002). "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services"
  • Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of distributed consensus with one faulty process"

Consensus

  • Ongaro, D. & Ousterhout, J. (2014). "In Search of an Understandable Consensus Algorithm (Raft)"
  • Lamport, L. (2001). "Paxos Made Simple"

Online Resources

Video Lectures

  • MIT 6.824: Distributed Systems
  • Stanford CS247: Advanced Distributed Systems

Practice

  • Build your own distributed system from scratch
  • Contribute to open-source distributed databases
  • Participate in distributed systems hackathons