Distributed Systems Course
Welcome to the Distributed Systems Course! This course will take you from foundational concepts to building a working consensus-based system.
Why Learn Distributed Systems?
Distributed systems are everywhere. Every time you use a modern web service, you're interacting with a distributed system:
- Social media platforms handling billions of users
- E-commerce sites processing millions of transactions
- Streaming services delivering content globally
- Cloud databases storing and replicating data across continents
Understanding distributed systems is essential for building scalable, reliable applications.
Course Overview
This course teaches distributed systems concepts through hands-on implementation. Over 10 sessions, you will build four progressively complex distributed applications:
| Application | Sessions | Concepts |
|---|---|---|
| Queue/Work System | 1-2 | Producer-consumer, message passing, fault tolerance |
| Store with Replication | 3-5 | Partitioning, CAP theorem, leader election, consistency |
| Chat System | 6-7 | WebSockets, pub/sub, message ordering |
| Consensus System | 8-10 | Raft algorithm, log replication, state machine |
What You'll Learn
By the end of this course, you will be able to:
- Explain distributed systems concepts including CAP theorem, consistency models, and consensus
- Build a working message queue system with producer-consumer pattern
- Implement a replicated key-value store with leader election
- Create a real-time chat system with pub/sub messaging
- Develop a consensus-based system using the Raft algorithm
- Deploy all systems using Docker Compose on your local machine
Target Audience
This course is designed for developers who:
- Have basic programming experience (functions, classes, basic OOP)
- Are new to distributed systems
- Want to understand how modern distributed applications work
- Prefer learning by doing over pure theory
Prerequisites
- Programming: Comfortable with either TypeScript or Python
- Command Line: Basic familiarity with terminal commands
- Docker: We'll cover Docker setup in the Docker Setup section
No prior distributed systems experience is required!
Course Progression
graph TB
subgraph "Part I: Fundamentals"
A1[What is a DS?] --> A2[Message Passing]
A2 --> A3[Queue System]
end
subgraph "Part II: Data Store"
B1[Partitioning] --> B2[CAP Theorem]
B2 --> B3[Replication]
B3 --> B4[Consistency]
end
subgraph "Part III: Real-Time"
C1[WebSockets] --> C2[Pub/Sub]
C2 --> C3[Chat System]
end
subgraph "Part IV: Consensus"
D1[What is Consensus?] --> D2[Raft Algorithm]
D2 --> D3[Leader Election]
D3 --> D4[Log Replication]
D4 --> D5[Consensus System]
end
A3 --> B1
B4 --> C1
C3 --> D1
Course Format
Each 1.5-hour session follows this structure:
graph LR
A[Review<br/>5 min] --> B[Concept<br/>20 min]
B --> C[Diagram<br/>10 min]
C --> D[Demo<br/>15 min]
D --> E[Exercise<br/>25 min]
E --> F[Test<br/>10 min]
F --> G[Summary<br/>5 min]
Session Components
- Concept Explanation: Clear, beginner-friendly explanations of core concepts
- Visual Diagrams: Mermaid diagrams showing architecture and data flow
- Live Demo: Step-by-step code walkthrough
- Hands-on Exercise: Practical exercises to reinforce learning
- Run & Test: Verify your implementation works correctly
Code Examples
Every concept includes implementations in both TypeScript and Python:
// TypeScript example
interface Message {
id: string;
content: string;
}
# Python example
@dataclass
class Message:
id: str
content: str
Choose the language you're most comfortable with, or learn both!
Before You Begin
1. Set Up Your Environment
Follow the Docker Setup Guide to install:
- Docker and Docker Compose
- Your preferred programming language (TypeScript or Python)
2. Verify Your Installation
docker --version
docker-compose --version
3. Choose Your Language
Decide whether you'll work with TypeScript or Python throughout the course. Both languages have complete examples for every concept.
Learning Tips
- Don't rush: Each concept builds on the previous ones
- Run the code: Follow along with the examples in your terminal
- Experiment: Modify the code and see what happens
- Ask questions: Use the troubleshooting guide when stuck
- Build in public: Share your progress and learn from others
What You'll Build
By the end of this course, you'll have four working distributed systems:
- Queue System - A fault-tolerant task processing system
- Replicated Store - A key-value store with leader election
- Chat System - A real-time messaging system with presence
- Consensus System - A Raft-based distributed database
All systems run locally using Docker Compose—no cloud infrastructure required!
Let's Get Started!
Ready to dive in? Continue to Chapter 1: What is a Distributed System?
What is a Distributed System?
Session 1, Part 1 - 20 minutes
Learning Objectives
- Define what a distributed system is
- Identify key characteristics of distributed systems
- Understand why distributed systems matter
- Recognize distributed systems in everyday life
Definition
A distributed system is a collection of independent computers that appears to its users as a single coherent system.
graph TB
subgraph "Users See"
Single["Single System"]
end
subgraph "Reality"
N1["Node 1"]
N2["Node 2"]
N3["Node 3"]
N4["Node N"]
N1 <--> N2
N2 <--> N3
N3 <--> N4
N4 <--> N1
end
Single -->|"appears as"| N1
Single -->|"appears as"| N2
Single -->|"appears as"| N3
Key Insight
The defining characteristic is the illusion of unity—users interact with what seems like one system, while behind the scenes, multiple machines work together.
Three Key Characteristics
According to Leslie Lamport, a distributed system is:
"One in which the failure of a computer you didn't even know existed can render your own computer unusable."
This definition highlights three fundamental characteristics:
1. Concurrency (Multiple Things Happen At Once)
Multiple components execute simultaneously, leading to complex interactions.
sequenceDiagram
participant U as User
participant A as Server A
participant B as Server B
participant C as Server C
U->>A: Request
A->>B: Query
A->>C: Update
B-->>A: Response
C-->>A: Ack
A-->>U: Result
2. No Global Clock
Each node has its own clock. There's no single "now" across the system.
graph LR
A[Clock A: 10:00:01.123]
B[Clock B: 10:00:02.456]
C[Clock C: 09:59:59.789]
A -.->|network latency| B
B -.->|network latency| C
C -.->|network latency| A
Implication: You can't rely on timestamps to order events across nodes. You need logical clocks (more on this in later sessions!).
3. Independent Failure
Components can fail independently. When one part fails, the rest may continue—or may become unusable.
stateDiagram-v2
[*] --> AllHealthy: System Start
AllHealthy --> PartialFailure: One Node Fails
AllHealthy --> CompleteFailure: Critical Nodes Fail
PartialFailure --> AllHealthy: Recovery
PartialFailure --> CompleteFailure: Cascading Failure
CompleteFailure --> [*]
Why Distributed Systems?
Scalability
Vertical Scaling (Scale Up):
- Add more resources to a single machine
- Eventually hits hardware/cost limits
Horizontal Scaling (Scale Out):
- Add more machines to the system
- Virtually unlimited scaling potential
graph TB
subgraph "Vertical Scaling"
Big[Big Expensive Server<br/>$100,000]
end
subgraph "Horizontal Scaling"
S1[Commodity Server<br/>$1,000]
S2[Commodity Server<br/>$1,000]
S3[Commodity Server<br/>$1,000]
S4[...]
end
Big <--> S1
Big <--> S2
Big <--> S3
Reliability & Availability
A single point of failure is unacceptable for critical services:
graph TB
subgraph "Single System"
S[Single Server]
S -.-> X[❌ Failure = No Service]
end
subgraph "Distributed System"
N1[Node 1]
N2[Node 2]
N3[Node 3]
N1 <--> N2
N2 <--> N3
N3 <--> N1
N1 -.-> X2[❌ One Fails]
X2 --> OK[✓ Others Continue]
end
Latency (Geographic Distribution)
Placing data closer to users improves experience:
graph TB
User[User in NYC]
subgraph "Global Distribution"
NYC[NYC Datacenter<br/>10ms latency]
LON[London Datacenter<br/>70ms latency]
TKY[Tokyo Datacenter<br/>150ms latency]
end
User --> NYC
User -.-> LON
User -.-> TKY
NYC <--> LON
LON <--> TKY
TKY <--> NYC
Examples of Distributed Systems
Everyday Examples
| System | Description | Benefit |
|---|---|---|
| Web Search | Query servers, index servers, cache servers | Fast responses, always available |
| Streaming Video | Content delivery networks (CDNs) | Low latency, high quality |
| Online Shopping | Product catalog, cart, payment, inventory | Handles traffic spikes |
| Social Media | Posts, comments, likes, notifications | Real-time updates |
Technical Examples
Database Replication:
graph LR
W[Write to Primary] --> P[(Primary DB)]
P --> R1[(Replica 1)]
P --> R2[(Replica 2)]
P --> R3[(Replica 3)]
R1 --> Read1[Read from Replica]
R2 --> Read2[Read from Replica]
R3 --> Read3[Read from Replica]
Load Balancing:
graph TB
Users[Users]
LB[Load Balancer]
Users --> LB
LB --> S1[Server 1]
LB --> S2[Server 2]
LB --> S3[Server 3]
LB --> S4[Server N]
Trade-offs
Distributed systems introduce complexity:
| Challenge | Description |
|---|---|
| Network Issues | Unreliable, variable latency, partitions |
| Concurrency | Race conditions, deadlocks, coordination |
| Partial Failures | Some components work, others don't |
| Consistency | Keeping data in sync across nodes |
The Fundamental Dilemma:
"Is the benefits of distribution worth the added complexity?"
For most modern applications, the answer is yes—which is why we're learning this!
Summary
Key Takeaways
- Distributed systems = multiple computers acting as one
- Three characteristics: concurrency, no global clock, independent failure
- Benefits: scalability, reliability, lower latency
- Costs: complexity, network issues, consistency challenges
Check Your Understanding
- Can you explain why there's no global clock in a distributed system?
- Give an example of a distributed system you use daily
- Why does independent failure make distributed systems harder to build?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Now that we understand what distributed systems are, let's explore how they communicate: Message Passing
Message Passing
Session 1, Part 2 - 25 minutes
Learning Objectives
- Understand message passing as a fundamental pattern in distributed systems
- Distinguish between synchronous and asynchronous messaging
- Learn different message delivery guarantees
- Implement basic message passing in TypeScript and Python
What is Message Passing?
In distributed systems, message passing is how nodes communicate. Instead of shared memory or direct function calls, components send messages to each other over the network.
graph LR
A[Node A]
B[Node B]
M[Message]
A -->|send| M
M -->|network| B
B -->|process| M
Key Insight
"In distributed systems, communication is not a function call—it's a request sent over an unreliable network."
This simple fact has profound implications for everything we build.
Synchronous vs Asynchronous
Synchronous Messaging (Request-Response)
The sender waits for a response before continuing.
sequenceDiagram
participant C as Client
participant S as Server
C->>S: Request
Note over C: Waiting...
S-->>C: Response
Note over C: Continue
Characteristics:
- Simple to understand and implement
- Caller is blocked during the call
- Easier error handling (immediate feedback)
- Can lead to poor performance and cascading failures
Asynchronous Messaging (Fire-and-Forget)
The sender continues without waiting for a response.
sequenceDiagram
participant P as Producer
participant Q as Queue
participant W as Worker
P->>Q: Send Message
Note over P: Continue immediately
Q->>W: Process Later
Note over W: Working...
W-->>P: Result (optional)
Characteristics:
- Non-blocking, better throughput
- More complex error handling
- Requires correlation IDs to track requests
- Enables loose coupling between components
Message Delivery Guarantees
Three Delivery Semantics
graph TB
subgraph "At Most Once"
A1[Send] --> A2[May be lost]
A2 --> A3[Never duplicated]
end
subgraph "At Least Once"
B1[Send] --> B2[Retries until ack]
B2 --> B3[May be duplicated]
end
subgraph "Exactly Once"
C1[Send] --> C2[Deduplication]
C2 --> C3[Perfect delivery]
end
Comparison
| Guarantee | Description | Cost | Use Case |
|---|---|---|---|
| At Most Once | Message may be lost, never duplicated | Lowest | Logging, metrics, non-critical data |
| At Least Once | Message guaranteed to arrive, may duplicate | Medium | Notifications, job queues |
| Exactly Once | Perfect delivery, no duplicates | Highest | Financial transactions, payments |
The Two Generals Problem
A classic proof that perfect communication is impossible in unreliable networks:
graph LR
A[General A<br/>City 1]
B[General B<br/>City 2]
A -->|"Attack at 8pm?"| B
B -->|"Ack: received"| A
A -->|"Ack: received your ack"| B
B -->|"Ack: received your ack of ack"| A
Note[A: infinite messages needed]
Implication: You can never be 100% certain a message was received without infinite acknowledgments.
In practice, we accept uncertainty and design systems that tolerate it.
Architecture Patterns
Direct Communication
graph LR
A[Service A] --> B[Service B]
A --> C[Service C]
B --> D[Service D]
C --> D
- Simple, straightforward
- Tight coupling
- Difficult to scale independently
Message Queue (Indirect Communication)
graph TB
P[Producer 1] --> Q[Message Queue]
P2[Producer 2] --> Q
P3[Producer N] --> Q
Q --> W1[Worker 1]
Q --> W2[Worker 2]
Q --> W3[Worker N]
- Loose coupling
- Easy to scale
- Buffers requests during traffic spikes
- Enables retry and error handling
Implementation Examples
TypeScript: HTTP (Synchronous)
// server.ts
import http from 'http';
const server = http.createServer((req, res) => {
if (req.method === 'POST' && req.url === '/message') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
const message = JSON.parse(body);
console.log('Received:', message);
// Send response back (synchronous)
res.writeHead(200);
res.end(JSON.stringify({ status: 'processed', id: message.id }));
});
}
});
server.listen(3000, () => console.log('Server on :3000'));
// client.ts
import http from 'http';
function sendMessage(data: any): Promise<any> {
return new Promise((resolve, reject) => {
const postData = JSON.stringify(data);
const options = {
hostname: 'localhost',
port: 3000,
method: 'POST',
path: '/message',
headers: { 'Content-Type': 'application/json' }
};
const req = http.request(options, (res) => {
let body = '';
res.on('data', chunk => body += chunk);
res.on('end', () => resolve(JSON.parse(body)));
});
req.on('error', reject);
req.write(postData);
req.end();
});
}
// Usage: waits for response
sendMessage({ id: '1', content: 'Hello' })
.then(response => console.log('Got:', response));
Python: HTTP (Synchronous)
# server.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
class MessageHandler(BaseHTTPRequestHandler):
def do_POST(self):
if self.path == '/message':
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
message = json.loads(post_data.decode())
print(f"Received: {message}")
# Send response back (synchronous)
response = json.dumps({'status': 'processed', 'id': message['id']})
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(response.encode())
server = HTTPServer(('localhost', 3000), MessageHandler)
print("Server on :3000")
server.serve_forever()
# client.py
import requests
import json
def send_message(data):
# Synchronous: waits for response
response = requests.post(
'http://localhost:3000/message',
json=data
)
return response.json()
# Usage
result = send_message({'id': '1', 'content': 'Hello'})
print(f"Got: {result}")
TypeScript: Simple Queue (Asynchronous)
// queue.ts
interface Message {
id: string;
data: any;
timestamp: number;
}
class MessageQueue {
private messages: Message[] = [];
private handlers: Map<string, (msg: Message) => void> = new Map();
publish(topic: string, data: any): string {
const message: Message = {
id: `${Date.now()}-${Math.random()}`,
data,
timestamp: Date.now()
};
this.messages.push(message);
console.log(`Published to ${topic}:`, message.id);
// Fire and forget - don't wait for processing
setImmediate(() => this.process(topic, message));
return message.id;
}
subscribe(topic: string, handler: (msg: Message) => void) {
this.handlers.set(topic, handler);
}
private process(topic: string, message: Message) {
const handler = this.handlers.get(topic);
if (handler) {
// Handle asynchronously - caller doesn't wait
handler(message);
}
}
}
// Usage
const queue = new MessageQueue();
queue.subscribe('tasks', (msg) => {
console.log(`Processing task ${msg.id}:`, msg.data);
// Simulate async work
setTimeout(() => console.log(`Task ${msg.id} complete`), 1000);
});
// Publish returns immediately - doesn't wait for processing
const taskId = queue.publish('tasks', { type: 'email', to: 'user@example.com' });
console.log(`Task ${taskId} queued (not yet processed)`);
Python: Simple Queue (Asynchronous)
# queue.py
import time
import threading
from dataclasses import dataclass
from typing import Callable, Dict, Any
import uuid
@dataclass
class Message:
id: str
data: Any
timestamp: float
class MessageQueue:
def __init__(self):
self.messages = []
self.handlers: Dict[str, Callable[[Message], None]] = {}
self.lock = threading.Lock()
def publish(self, topic: str, data: Any) -> str:
message = Message(
id=f"{int(time.time()*1000)}-{uuid.uuid4().hex[:8]}",
data=data,
timestamp=time.time()
)
with self.lock:
self.messages.append(message)
print(f"Published to {topic}: {message.id}")
# Fire and forget - don't wait for processing
threading.Thread(
target=self._process,
args=(topic, message),
daemon=True
).start()
return message.id
def subscribe(self, topic: str, handler: Callable[[Message], None]):
self.handlers[topic] = handler
def _process(self, topic: str, message: Message):
handler = self.handlers.get(topic)
if handler:
# Handle asynchronously - caller doesn't wait
handler(message)
# Usage
queue = MessageQueue()
def handle_task(msg: Message):
print(f"Processing task {msg.id}: {msg.data}")
# Simulate async work
time.sleep(1)
print(f"Task {msg.id} complete")
queue.subscribe('tasks', handle_task)
# Publish returns immediately - doesn't wait for processing
task_id = queue.publish('tasks', {'type': 'email', 'to': 'user@example.com'})
print(f"Task {task_id} queued (not yet processed)")
# Keep main thread alive to see processing
time.sleep(2)
Common Message Patterns
Request-Response
// Call and wait for answer
const answer = await ask(question);
Fire-and-Forget
// Send and continue
notify(user);
Publish-Subscribe
// Many receivers, one sender
broker.publish('events', data);
Request-Reply (Correlation)
// Send request, get reply later
const replyTo = createReplyQueue();
broker.send(request, { replyTo });
// ... later
const reply = await replyTo.receive();
Error Handling
Message passing over networks is unreliable. Common issues:
| Error | Cause | Handling Strategy |
|---|---|---|
| Timeout | No response, network slow | Retry with backoff |
| Connection Refused | Service down | Circuit breaker, queue for later |
| Message Lost | Network failure | Acknowledgments, retries |
| Duplicate | Retry after slow ack | Idempotent operations |
Retry Pattern
async function sendMessageWithRetry(
message: any,
maxRetries = 3
): Promise<any> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await sendMessage(message);
} catch (error) {
if (attempt === maxRetries) throw error;
// Exponential backoff: 100ms, 200ms, 400ms
const delay = 100 * Math.pow(2, attempt - 1);
await new Promise(r => setTimeout(r, delay));
console.log(`Retry ${attempt}/${maxRetries}`);
}
}
}
Summary
Key Takeaways
- Message passing is how distributed systems communicate
- Synchronous = wait for response; Asynchronous = fire and forget
- Delivery guarantees: at-most-once, at-least-once, exactly-once
- Network is unreliable - design for failures and retries
- Choose the right pattern for your use case
Check Your Understanding
- When would you use synchronous vs asynchronous messaging?
- What's the difference between at-least-once and exactly-once?
- Why is perfect communication impossible in distributed systems?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Now let's apply message passing to build our first distributed system: Queue System Implementation
Queue System Implementation
Session 2 - Full session (90 minutes)
Learning Objectives
- Understand the producer-consumer pattern
- Build a working queue system with concurrent workers
- Implement fault tolerance with retry logic
- Deploy and test the system using Docker Compose
The Producer-Consumer Pattern
The producer-consumer pattern is a fundamental distributed systems pattern where:
- Producers create and send tasks to a queue
- Queue buffers tasks between producers and consumers
- Workers (consumers) process tasks from the queue
graph TB
subgraph "Producers"
P1[Producer 1<br/>API Server]
P2[Producer 2<br/>Scheduler]
P3[Producer N<br/>Webhook]
end
subgraph "Queue"
Q[Message Queue<br/>Task Buffer]
end
subgraph "Workers"
W1[Worker 1<br/>Process]
W2[Worker 2<br/>Process]
W3[Worker 3<br/>Process]
end
P1 --> Q
P2 --> Q
P3 --> Q
Q --> W1
Q --> W2
Q --> W3
style Q fill:#f9f,stroke:#333,stroke-width:4px
Key Benefits
| Benefit | Explanation |
|---|---|
| Decoupling | Producers don't need to know about workers |
| Buffering | Queue handles traffic spikes |
| Scalability | Add/remove workers independently |
| Reliability | Tasks persist if workers fail |
| Retry | Failed tasks can be requeued |
System Architecture
Full System View
sequenceDiagram
participant C as Client
participant P as Producer
participant Q as Queue
participant W as Worker
participant DB as Result Store
C->>P: HTTP POST /task
P->>Q: Enqueue Task
Q-->>P: Task ID
P-->>C: 202 Accepted
Note over Q,W: Async Processing
Q->>W: Fetch Task
W->>W: Process Task
W->>DB: Save Result
W->>Q: Ack (Success)
Q->>Q: Remove Task
Task Lifecycle
stateDiagram-v2
[*] --> Pending: Producer creates
Pending --> Processing: Worker fetches
Processing --> Completed: Success
Processing --> Failed: Error
Processing --> Pending: Retry
Failed --> Pending: Max retries not reached
Failed --> DeadLetter: Max retries reached
Completed --> [*]
DeadLetter --> [*]
Implementation
Data Models
Task Definition:
interface Task {
id: string;
type: string; // 'email', 'image', 'report', etc.
payload: any;
status: 'pending' | 'processing' | 'completed' | 'failed';
createdAt: number;
retries: number;
maxRetries: number;
result?: any;
error?: string;
}
from dataclasses import dataclass, field
from typing import Any, Optional
@dataclass
class Task:
id: str
type: str # 'email', 'image', 'report', etc.
payload: Any
status: str = 'pending' # pending, processing, completed, failed
created_at: float = field(default_factory=time.time)
retries: int = 0
max_retries: int = 3
result: Optional[Any] = None
error: Optional[str] = None
TypeScript Implementation
Project Structure
queue-system-ts/
├── package.json
├── docker-compose.yml
├── src/
│ ├── queue.ts # Queue implementation
│ ├── producer.ts # Producer API
│ ├── worker.ts # Worker implementation
│ └── types.ts # Type definitions
└── Dockerfile
Complete TypeScript Code
queue-system-ts/src/types.ts
export interface Task {
id: string;
type: string;
payload: any;
status: 'pending' | 'processing' | 'completed' | 'failed';
createdAt: number;
retries: number;
maxRetries: number;
result?: any;
error?: string;
}
export interface QueueMessage {
task: Task;
timestamp: number;
}
queue-system-ts/src/queue.ts
import { Task, QueueMessage } from './types';
export class Queue {
private pending: Task[] = [];
private processing: Map<string, Task> = new Map();
private completed: Task[] = [];
private failed: Task[] = [];
// Enqueue a new task
enqueue(type: string, payload: any): string {
const task: Task = {
id: this.generateId(),
type,
payload,
status: 'pending',
createdAt: Date.now(),
retries: 0,
maxRetries: 3
};
this.pending.push(task);
console.log(`[Queue] Enqueued task ${task.id} (${type})`);
return task.id;
}
// Get next pending task (for workers)
dequeue(): Task | null {
if (this.pending.length === 0) return null;
const task = this.pending.shift()!;
task.status = 'processing';
this.processing.set(task.id, task);
console.log(`[Queue] Dequeued task ${task.id}`);
return task;
}
// Mark task as completed
complete(taskId: string, result?: any): void {
const task = this.processing.get(taskId);
if (!task) return;
task.status = 'completed';
task.result = result;
this.processing.delete(taskId);
this.completed.push(task);
console.log(`[Queue] Completed task ${taskId}`);
}
// Mark task as failed (will retry if possible)
fail(taskId: string, error: string): void {
const task = this.processing.get(taskId);
if (!task) return;
task.retries++;
task.error = error;
if (task.retries >= task.maxRetries) {
task.status = 'failed';
this.processing.delete(taskId);
this.failed.push(task);
console.log(`[Queue] Task ${taskId} failed permanently after ${task.retries} retries`);
} else {
task.status = 'pending';
this.processing.delete(taskId);
this.pending.push(task);
console.log(`[Queue] Task ${taskId} failed, retrying (${task.retries}/${task.maxRetries})`);
}
}
// Get queue statistics
getStats() {
return {
pending: this.pending.length,
processing: this.processing.size,
completed: this.completed.length,
failed: this.failed.length
};
}
private generateId(): string {
return `task-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
}
}
queue-system-ts/src/producer.ts
import http from 'http';
import { Queue } from './queue';
const queue = new Queue();
const server = http.createServer((req, res) => {
if (req.method === 'POST' && req.url === '/task') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const { type, payload } = JSON.parse(body);
if (!type || !payload) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'type and payload required' }));
return;
}
const taskId = queue.enqueue(type, payload);
res.writeHead(202); // Accepted
res.end(JSON.stringify({
taskId,
message: 'Task enqueued',
stats: queue.getStats()
}));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid JSON' }));
}
});
} else if (req.method === 'GET' && req.url === '/stats') {
res.writeHead(200);
res.end(JSON.stringify(queue.getStats()));
} else {
res.writeHead(404);
res.end(JSON.stringify({ error: 'Not found' }));
}
});
const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
console.log(`Producer API listening on port ${PORT}`);
});
export { queue };
queue-system-ts/src/worker.ts
import http from 'http';
import { Queue, Task } from './types';
// Simulate task processing
async function processTask(task: Task): Promise<any> {
console.log(`[Worker] Processing task ${task.id} (${task.type})`);
// Simulate work
await new Promise(resolve => setTimeout(resolve, 1000 + Math.random() * 2000));
// Simulate occasional failures (20% chance)
if (Math.random() < 0.2) {
throw new Error('Random processing error');
}
// Process based on task type
switch (task.type) {
case 'email':
return { sent: true, to: task.payload.to };
case 'image':
return { processed: true, url: task.payload.url };
case 'report':
return { generated: true, format: 'pdf' };
default:
return { result: 'processed' };
}
}
class Worker {
private id: string;
private queueUrl: string;
private running: boolean = false;
constructor(id: string, queueUrl: string) {
this.id = id;
this.queueUrl = queueUrl;
}
async start(): Promise<void> {
this.running = true;
console.log(`[Worker ${this.id}] Started`);
while (this.running) {
try {
await this.processNextTask();
} catch (error) {
console.error(`[Worker ${this.id}] Error:`, error);
await this.sleep(1000); // Wait before retrying
}
}
}
private async processNextTask(): Promise<void> {
// Fetch task from queue
const task = await this.fetchTask();
if (!task) {
await this.sleep(1000); // No task, wait
return;
}
try {
// Process the task
const result = await processTask(task);
// Mark as complete
await this.completeTask(task.id, result);
} catch (error: any) {
// Mark as failed
await this.failTask(task.id, error.message);
}
}
private async fetchTask(): Promise<Task | null> {
return new Promise((resolve, reject) => {
http.get(`${this.queueUrl}/dequeue`, (res) => {
let body = '';
res.on('data', chunk => body += chunk);
res.on('end', () => {
if (res.statusCode === 204) {
resolve(null); // No tasks available
} else if (res.statusCode === 200) {
resolve(JSON.parse(body));
} else {
reject(new Error(`Unexpected status: ${res.statusCode}`));
}
});
}).on('error', reject);
});
}
private async completeTask(taskId: string, result: any): Promise<void> {
return new Promise((resolve, reject) => {
const data = JSON.stringify({ result });
http.request({
hostname: 'localhost',
port: 3000,
path: `/complete/${taskId}`,
method: 'POST',
headers: { 'Content-Type': 'application/json', 'Content-Length': data.length }
}, (res) => {
if (res.statusCode === 200) {
resolve();
} else {
reject(new Error(`Failed to complete task: ${res.statusCode}`));
}
}).on('error', reject).end(data);
});
}
private async failTask(taskId: string, error: string): Promise<void> {
return new Promise((resolve, reject) => {
const data = JSON.stringify({ error });
http.request({
hostname: 'localhost',
port: 3000,
path: `/fail/${taskId}`,
method: 'POST',
headers: { 'Content-Type': 'application/json', 'Content-Length': data.length }
}, (res) => {
if (res.statusCode === 200) {
resolve();
} else {
reject(new Error(`Failed to fail task: ${res.statusCode}`));
}
}).on('error', reject).end(data);
});
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
stop(): void {
this.running = false;
}
}
// Start worker
const workerId = process.env.WORKER_ID || 'worker-1';
const worker = new Worker(workerId, 'http://localhost:3000');
worker.start();
Python Implementation
Project Structure
queue-system-py/
├── requirements.txt
├── docker-compose.yml
├── src/
│ ├── queue.py # Queue implementation
│ ├── producer.py # Producer API
│ └── worker.py # Worker implementation
└── Dockerfile
Complete Python Code
queue-system-py/src/queue.py
import time
import uuid
from dataclasses import dataclass, field
from typing import Any, Optional, List, Dict
from enum import Enum
class TaskStatus(Enum):
PENDING = 'pending'
PROCESSING = 'processing'
COMPLETED = 'completed'
FAILED = 'failed'
@dataclass
class Task:
id: str
type: str
payload: Any
status: str = TaskStatus.PENDING.value
created_at: float = field(default_factory=time.time)
retries: int = 0
max_retries: int = 3
result: Optional[Any] = None
error: Optional[str] = None
class Queue:
def __init__(self):
self.pending: List[Task] = []
self.processing: Dict[str, Task] = {}
self.completed: List[Task] = []
self.failed: List[Task] = []
def enqueue(self, task_type: str, payload: Any) -> str:
"""Enqueue a new task."""
task = Task(
id=f"task-{int(time.time()*1000)}-{uuid.uuid4().hex[:8]}",
type=task_type,
payload=payload
)
self.pending.append(task)
print(f"[Queue] Enqueued task {task.id} ({task_type})")
return task.id
def dequeue(self) -> Optional[Task]:
"""Get next pending task."""
if not self.pending:
return None
task = self.pending.pop(0)
task.status = TaskStatus.PROCESSING.value
self.processing[task.id] = task
print(f"[Queue] Dequeued task {task.id}")
return task
def complete(self, task_id: str, result: Any = None) -> None:
"""Mark task as completed."""
task = self.processing.pop(task_id, None)
if not task:
return
task.status = TaskStatus.COMPLETED.value
task.result = result
self.completed.append(task)
print(f"[Queue] Completed task {task_id}")
def fail(self, task_id: str, error: str) -> None:
"""Mark task as failed (will retry if possible)."""
task = self.processing.pop(task_id, None)
if not task:
return
task.retries += 1
task.error = error
if task.retries >= task.max_retries:
task.status = TaskStatus.FAILED.value
self.failed.append(task)
print(f"[Queue] Task {task_id} failed permanently after {task.retries} retries")
else:
task.status = TaskStatus.PENDING.value
self.pending.append(task)
print(f"[Queue] Task {task_id} failed, retrying ({task.retries}/{task.max_retries})")
def get_stats(self) -> Dict[str, int]:
"""Get queue statistics."""
return {
'pending': len(self.pending),
'processing': len(self.processing),
'completed': len(self.completed),
'failed': len(self.failed)
}
queue-system-py/src/producer.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
from queue import Queue
queue = Queue()
class ProducerHandler(BaseHTTPRequestHandler):
def do_POST(self):
if self.path == '/task':
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
try:
data = json.loads(post_data.decode())
task_type = data.get('type')
payload = data.get('payload')
if not task_type or not payload:
self.send_error(400, 'type and payload required')
return
task_id = queue.enqueue(task_type, payload)
response = json.dumps({
'taskId': task_id,
'message': 'Task enqueued',
'stats': queue.get_stats()
})
self.send_response(202) # Accepted
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(response.encode())
except json.JSONDecodeError:
self.send_error(400, 'Invalid JSON')
def do_GET(self):
if self.path == '/stats':
response = json.dumps(queue.get_stats())
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(response.encode())
def log_message(self, format, *args):
pass # Suppress default logging
if __name__ == '__main__':
import os
port = int(os.environ.get('PORT', 3000))
server = HTTPServer(('0.0.0.0', port), ProducerHandler)
print(f"Producer API listening on port {port}")
server.serve_forever()
queue-system-py/src/worker.py
import os
import time
import random
import requests
from typing import Optional, Dict, Any
from queue import Task
# Simulate task processing
def process_task(task: Task) -> Any:
print(f"[Worker] Processing task {task.id} ({task.type})")
# Simulate work
time.sleep(1 + random.random() * 2)
# Simulate occasional failures (20% chance)
if random.random() < 0.2:
raise Exception('Random processing error')
# Process based on task type
if task.type == 'email':
return {'sent': True, 'to': task.payload.get('to')}
elif task.type == 'image':
return {'processed': True, 'url': task.payload.get('url')}
elif task.type == 'report':
return {'generated': True, 'format': 'pdf'}
else:
return {'result': 'processed'}
class Worker:
def __init__(self, worker_id: str, queue_url: str):
self.id = worker_id
self.queue_url = queue_url
self.running = False
def start(self):
"""Start the worker loop."""
self.running = True
print(f"[Worker {self.id}] Started")
while self.running:
try:
self.process_next_task()
except Exception as e:
print(f"[Worker {self.id}] Error: {e}")
time.sleep(1)
def process_next_task(self):
"""Fetch and process the next task."""
task = self.fetch_task()
if not task:
time.sleep(1) # No task, wait
return
try:
result = process_task(task)
self.complete_task(task['id'], result)
except Exception as e:
self.fail_task(task['id'], str(e))
def fetch_task(self) -> Optional[Dict]:
"""Fetch next task from queue."""
try:
response = requests.get(f"{self.queue_url}/dequeue", timeout=5)
if response.status_code == 204:
return None # No tasks
return response.json()
except requests.RequestException:
return None
def complete_task(self, task_id: str, result: Any):
"""Mark task as complete."""
requests.post(
f"{self.queue_url}/complete/{task_id}",
json={'result': result},
timeout=5
)
def fail_task(self, task_id: str, error: str):
"""Mark task as failed."""
requests.post(
f"{self.queue_url}/fail/{task_id}",
json={'error': error},
timeout=5
)
def stop(self):
"""Stop the worker."""
self.running = False
if __name__ == '__main__':
worker_id = os.environ.get('WORKER_ID', 'worker-1')
queue_url = os.environ.get('QUEUE_URL', 'http://localhost:3000')
worker = Worker(worker_id, queue_url)
worker.start()
Docker Compose Setup
TypeScript Version (docker-compose.yml)
version: '3.8'
services:
producer:
build: ./src
ports:
- "3000:3000"
environment:
- PORT=3000
volumes:
- ./src:/app/src
command: npm run start:producer
worker-1:
build: ./src
environment:
- WORKER_ID=worker-1
depends_on:
- producer
volumes:
- ./src:/app/src
command: npm run start:worker
worker-2:
build: ./src
environment:
- WORKER_ID=worker-2
depends_on:
- producer
volumes:
- ./src:/app/src
command: npm run start:worker
worker-3:
build: ./src
environment:
- WORKER_ID=worker-3
depends_on:
- producer
volumes:
- ./src:/app/src
command: npm run start:worker
TypeScript Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["npm", "run", "start:producer"]
Python Version (docker-compose.yml)
version: '3.8'
services:
producer:
build: ./src
ports:
- "3000:3000"
environment:
- PORT=3000
volumes:
- ./src:/app/src
command: python src/producer.py
worker-1:
build: ./src
environment:
- WORKER_ID=worker-1
- QUEUE_URL=http://producer:3000
depends_on:
- producer
volumes:
- ./src:/app/src
command: python src/worker.py
worker-2:
build: ./src
environment:
- WORKER_ID=worker-2
- QUEUE_URL=http://producer:3000
depends_on:
- producer
volumes:
- ./src:/app/src
command: python src/worker.py
worker-3:
build: ./src
environment:
- WORKER_ID=worker-3
- QUEUE_URL=http://producer:3000
depends_on:
- producer
volumes:
- ./src:/app/src
command: python src/worker.py
Python Dockerfile
FROM python:3.11-alpine
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "src/producer.py"]
Running the Example
Step 1: Start the System
cd examples/01-queue
docker-compose up --build
You should see output like:
producer | Producer API listening on port 3000
worker-1 | [Worker worker-1] Started
worker-2 | [Worker worker-2] Started
worker-3 | [Worker worker-3] Started
Step 2: Submit Tasks
Open a new terminal and submit some tasks:
# Submit an email task
curl -X POST http://localhost:3000/task \
-H "Content-Type: application/json" \
-d '{"type": "email", "payload": {"to": "user@example.com", "subject": "Hello"}}'
# Submit an image processing task
curl -X POST http://localhost:3000/task \
-H "Content-Type: application/json" \
-d '{"type": "image", "payload": {"url": "https://example.com/image.jpg"}}'
# Submit multiple tasks
for i in {1..10}; do
curl -X POST http://localhost:3000/task \
-H "Content-Type: application/json" \
-d "{\"type\": \"report\", \"payload\": {\"id\": $i}}"
done
Step 3: Watch Processing
In the Docker logs, you'll see:
worker-2 | [Queue] Dequeued task task-1234567890-abc123
worker-2 | [Worker] Processing task task-1234567890-abc123 (report)
worker-2 | [Queue] Completed task task-1234567890-abc123
Step 4: Check Statistics
curl http://localhost:3000/stats
Response:
{
"pending": 5,
"processing": 3,
"completed": 12,
"failed": 0
}
Step 5: Test Fault Tolerance
Stop one worker:
docker-compose stop worker-1
Tasks continue processing with the remaining workers. The queue automatically handles the load redistribution.
Exercises
Exercise 1: Add Priority Support
Modify the queue to support high/normal/low priority tasks:
- Add a
priorityfield to the Task model - Modify
enqueue()to sort pending tasks by priority - Test with mixed priority tasks
Exercise 2: Implement Dead Letter Queue
Create a separate queue for permanently failed tasks:
- Add a
dead_letterqueue to store failed tasks - Add an API endpoint to inspect/retry dead letter tasks
- Log failed tasks to a file for manual inspection
Exercise 3: Add Task Scheduling
Implement delayed task execution:
- Add an
executeAttimestamp to tasks - Modify workers to skip tasks scheduled for the future
- Use a timer/scheduler to move scheduled tasks to pending queue
Summary
Key Takeaways
- Producer-consumer pattern decouples task creation from processing
- Queues buffer tasks and handle traffic spikes
- Workers scale independently of producers
- Retry logic provides fault tolerance
- Docker Compose enables easy local deployment
Check Your Understanding
- How does the queue handle worker failures?
- What happens when a task fails and max retries is reached?
- Why is the queue useful for handling traffic spikes?
- How would you add a new worker type (e.g., a worker that only processes emails)?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Now that we've built a queue system, let's explore how to partition data across multiple nodes: Data Partitioning
Data Partitioning
Session 3, Part 1 - 25 minutes
Learning Objectives
- Understand what data partitioning (sharding) is
- Compare hash-based vs range-based partitioning
- Learn how partitioning affects query performance
- Recognize the trade-offs of different partitioning strategies
What is Partitioning?
Data partitioning (also called sharding) is the process of splitting your data across multiple nodes based on a partitioning key. Each node holds a subset of the total data.
graph TB
subgraph "Application View"
App["Your Application"]
Data[("All Data")]
App --> Data
end
subgraph "Reality: Partitioned Storage"
Node1["Node 1<br/>Keys: user_1<br/>user_4<br/>user_7"]
Node2["Node 2<br/>Keys: user_2<br/>user_5<br/>user_8"]
Node3["Node 3<br/>Keys: user_3<br/>user_6<br/>user_9"]
end
App -->|"read/write"| Node1
App -->|"read/write"| Node2
App -->|"read/write"| Node3
style Node1 fill:#e1f5fe
style Node2 fill:#e1f5fe
style Node3 fill:#e1f5fe
Why Partition Data?
| Benefit | Description |
|---|---|
| Scalability | Store more data than fits on one machine |
| Performance | Distribute load across multiple nodes |
| Availability | One partition failure doesn't affect others |
The Partitioning Challenge
The key question is: How do we decide which data goes on which node?
graph LR
Key["user:12345"] --> Router{Partitioning<br/>Function}
Router -->|"hash(key) % N"| N1[Node 1]
Router --> N2[Node 2]
Router --> N3[Node 3]
style Router fill:#ff9,stroke:#333,stroke-width:3px
Partitioning Strategies
1. Hash-Based Partitioning
Apply a hash function to the key, then modulo the number of nodes:
node = hash(key) % number_of_nodes
graph TB
subgraph "Hash-Based Partitioning (3 nodes)"
Key1["user:alice"] --> H1["hash() % 3"]
Key2["user:bob"] --> H2["hash() % 3"]
Key3["user:carol"] --> H3["hash() % 3"]
H1 -->|"= 1"| N1[Node 1]
H2 -->|"= 2"| N2[Node 2]
H3 -->|"= 0"| N0[Node 0]
style N1 fill:#c8e6c9
style N2 fill:#c8e6c9
style N0 fill:#c8e6c9
end
TypeScript Example:
function getNode(key: string, totalNodes: number): number {
// Simple hash function
let hash = 0;
for (let i = 0; i < key.length; i++) {
hash = ((hash << 5) - hash) + key.charCodeAt(i);
hash = hash & hash; // Convert to 32bit integer
}
return Math.abs(hash) % totalNodes;
}
// Examples
console.log(getNode('user:alice', 3)); // => 1
console.log(getNode('user:bob', 3)); // => 2
console.log(getNode('user:carol', 3)); // => 0
Python Example:
def get_node(key: str, total_nodes: int) -> int:
"""Determine which node should store this key."""
hash_value = hash(key) # Built-in hash function
return abs(hash_value) % total_nodes
# Examples
print(get_node('user:alice', 3)) # => 1
print(get_node('user:bob', 3)) # => 2
print(get_node('user:carol', 3)) # => 0
Advantages:
- ✅ Even data distribution
- ✅ Simple to implement
- ✅ No hotspots (assuming good hash function)
Disadvantages:
- ❌ Cannot do efficient range queries
- ❌ Rebalancing is expensive when adding/removing nodes
2. Range-Based Partitioning
Assign key ranges to each node:
graph TB
subgraph "Range-Based Partitioning (3 nodes)"
R1["Node 1<br/>a-m"]
R2["Node 2<br/>n-s"]
R3["Node 3<br/>t-z"]
Key1["alice"] --> R1
Key2["bob"] --> R1
Key3["nancy"] --> R2
Key4["steve"] --> R2
Key5["tom"] --> R3
Key6["zoe"] --> R3
style R1 fill:#c8e6c9
style R2 fill:#c8e6c9
style R3 fill:#c8e6c9
end
TypeScript Example:
interface Range {
start: string;
end: string;
node: number;
}
const ranges: Range[] = [
{ start: 'a', end: 'm', node: 1 },
{ start: 'n', end: 's', node: 2 },
{ start: 't', end: 'z', node: 3 }
];
function getNodeByRange(key: string): number {
for (const range of ranges) {
if (key >= range.start && key <= range.end) {
return range.node;
}
}
throw new Error(`No range found for key: ${key}`);
}
// Examples
console.log(getNodeByRange('alice')); // => 1
console.log(getNodeByRange('nancy')); // => 2
console.log(getNodeByRange('tom')); // => 3
Python Example:
from typing import List, Tuple
ranges: List[Tuple[str, str, int]] = [
('a', 'm', 1),
('n', 's', 2),
('t', 'z', 3)
]
def get_node_by_range(key: str) -> int:
"""Determine which node based on key range."""
for start, end, node in ranges:
if start <= key <= end:
return node
raise ValueError(f"No range found for key: {key}")
# Examples
print(get_node_by_range('alice')) # => 1
print(get_node_by_range('nancy')) # => 2
print(get_node_by_range('tom')) # => 3
Advantages:
- ✅ Efficient range queries
- ✅ Can optimize for data access patterns
Disadvantages:
- ❌ Uneven distribution (hotspots)
- ❌ Complex to load balance
The Rebalancing Problem
What happens when you add or remove nodes?
stateDiagram-v2
[*] --> Stable: 3 Nodes
Stable --> Rebalancing: Add Node 4
Rebalancing --> Stable: Move 25% of data
Stable --> Rebalancing: Remove Node 2
Rebalancing --> Stable: Redistribute data
Simple Modulo Hashing Problem
With hash(key) % N, changing N from 3 to 4 means most keys move to different nodes:
| Key | hash % 3 | hash % 4 | Moved? |
|---|---|---|---|
| user:1 | 1 | 1 | ❌ |
| user:2 | 2 | 2 | ❌ |
| user:3 | 0 | 3 | ✅ |
| user:4 | 1 | 0 | ✅ |
| user:5 | 2 | 1 | ✅ |
| user:6 | 0 | 2 | ✅ |
75% of keys moved!
Consistent Hashing (Advanced)
A technique to minimize data movement when nodes change:
graph TB
subgraph "Hash Ring"
Ring["Virtual Ring (0 - 2^32)"]
N1["Node 1<br/>position: 100"]
N2["Node 2<br/>position: 500"]
N3["Node 3<br/>position: 900"]
K1["Key A<br/>hash: 150"]
K2["Key B<br/>hash: 600"]
K3["Key C<br/>hash: 950"]
end
Ring --> N1
Ring --> N2
Ring --> N3
K1 -->|"clockwise"| N2
K2 -->|"clockwise"| N3
K3 -->|"clockwise"| N1
style Ring fill:#f9f,stroke:#333,stroke-width:2px
Key Idea: Each key is assigned to the first node clockwise from its hash position.
When adding/removing a node, only keys in that node's range move.
Query Patterns and Partitioning
Your query patterns should influence your partitioning strategy:
Common Query Patterns
| Query Type | Best Partitioning | Example |
|---|---|---|
| Key-value lookups | Hash-based | Get user by ID |
| Range scans | Range-based | Users registered last week |
| Multi-key access | Composite hash | Orders by customer |
| Geographic queries | Location-based | Nearby restaurants |
Example: User Data Partitioning
graph TB
subgraph "Application: Social Network"
Query1["Get User Profile<br/>SELECT * FROM users WHERE id = ?"]
Query2["List Friends<br/>SELECT * FROM friends WHERE user_id = ?"]
Query3["Timeline Posts<br/>SELECT * FROM posts WHERE created_at > ?"]
end
subgraph "Partitioning Decision"
Query1 -->|"hash(user_id)"| Hash[Hash-Based]
Query2 -->|"hash(user_id)"| Hash
Query3 -->|"range(created_at)"| Range[Range-Based]
end
subgraph "Result"
Hash --> H["User data & friends<br/>partitioned by user_id"]
Range --> R["Posts partitioned<br/>by date range"]
end
Trade-offs Summary
| Strategy | Distribution | Range Queries | Rebalancing | Complexity |
|---|---|---|---|---|
| Hash-based | Even | Poor | Expensive | Low |
| Range-based | Potentially uneven | Excellent | Moderate | Medium |
| Consistent hashing | Even | Poor | Minimal | High |
Real-World Examples
| System | Partitioning Strategy | Notes |
|---|---|---|
| Redis Cluster | Hash slots (16384 slots) | Consistent hashing |
| Cassandra | Token-aware (hash ring) | Configurable partitioner |
| MongoDB | Shard key ranges | Range-based on shard key |
| DynamoDB | Hash + range (composite) | Supports composite keys |
| PostgreSQL | Not native | Use extensions like Citus |
Summary
Key Takeaways
- Partitioning splits data across multiple nodes for scalability
- Hash-based gives even distribution but poor range queries
- Range-based enables range scans but can create hotspots
- Rebalancing is a key challenge when nodes change
- Query patterns should drive your partitioning strategy
Check Your Understanding
- Why is hash-based partitioning better for even distribution?
- When would you choose range-based over hash-based?
- What happens to data placement when you add a new node with simple modulo hashing?
- How does consistent hashing minimize data movement?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Now that we understand how to partition data, let's explore the fundamental trade-offs in distributed data systems: CAP Theorem
CAP Theorem
Session 3, Part 2 - 30 minutes
Learning Objectives
- Understand the CAP theorem and its three components
- Explore the trade-offs between Consistency, Availability, and Partition tolerance
- Identify real-world systems and their CAP choices
- Learn how to apply CAP thinking to system design
What is the CAP Theorem?
The CAP theorem states that a distributed data store can only provide two of the following three guarantees:
graph TB
subgraph "CAP Triangle - Pick Two"
C["Consistency<br/>Every read receives<br/>the most recent write"]
A["Availability<br/>Every request receives<br/>a response"]
P["Partition Tolerance<br/>System operates<br/>despite network failures"]
end
C <--> A
A <--> P
P <--> C
style C fill:#ffcdd2
style A fill:#c8e6c9
style P fill:#bbdefb
The Three Components
1. Consistency (C)
Every read receives the most recent write or an error.
All nodes see the same data at the same time. If you write a value and immediately read it, you get the value you just wrote.
sequenceDiagram
participant C as Client
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
C->>N1: Write X = 10
N1->>N2: Replicate X
N1->>N3: Replicate X
N2-->>N1: Ack
N3-->>N1: Ack
N1-->>C: Write confirmed
Note over C,N3: Before reading...
C->>N2: Read X
N2-->>C: X = 10 (latest)
Note over C,N3: All nodes agree!
Example: A bank system where your balance must be accurate across all branches.
2. Availability (A)
Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
The system remains operational even when some nodes fail. You can always read and write, even if the data might be stale.
sequenceDiagram
participant C as Client
participant N1 as Node 1 (alive)
participant N2 as Node 2 (dead)
C->>N1: Write X = 10
N1-->>C: Write confirmed
Note over C,N2: N2 is down but N1 responds...
C->>N1: Read X
N1-->>C: X = 10
Note over C,N2: System stays available!
Example: A social media feed where showing slightly old content is acceptable.
3. Partition Tolerance (P)
The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.
Network partitions are inevitable in distributed systems. The system must handle them gracefully.
graph TB
subgraph "Network Partition"
N1["Node 1<br/>Can't reach N2, N3"]
N2["Node 2<br/>Can't reach N1"]
N3["Node 3<br/>Can't reach N1"]
end
N1 -.->|"🔴 Network Partition"| N2
N1 -.->|"🔴 Network Partition"| N3
N2 <--> N3
N2 <--> N3
style N1 fill:#ffcdd2
style N2 fill:#c8e6c9
style N3 fill:#c8e6c9
Key Insight: In distributed systems, P is not optional—network partitions WILL happen.
The Trade-offs
Since partitions are inevitable in distributed systems, the real choice is between C and A during a partition:
stateDiagram-v2
[*] --> Normal
Normal --> Partitioned: Network Split
Partitioned --> CP: Choose Consistency
Partitioned --> AP: Choose Availability
CP --> Normal: Partition heals
AP --> Normal: Partition heals
note right of CP
Reject writes/reads
until data syncs
end note
note right of AP
Accept writes/reads
data may be stale
end note
CP: Consistency + Partition Tolerance
Sacrifice Availability
During a partition, the system returns errors or blocks until consistency can be guaranteed.
sequenceDiagram
participant C as Client
participant N1 as Node 1 (primary)
participant N2 as Node 2 (isolated)
Note over N1,N2: 🔴 Network Partition
C->>N1: Write X = 10
N1-->>C: ❌ Error: Cannot replicate
C->>N2: Read X
N2-->>C: ❌ Error: Data unavailable
Note over C,N2: System blocks rather<br/>than return stale data
Examples:
- MongoDB (with majority write concern)
- HBase
- Redis (with proper configuration)
- Traditional RDBMS with synchronous replication
Use when: Data accuracy is critical (financial systems, inventory)
AP: Availability + Partition Tolerance
Sacrifice Consistency
During a partition, the system accepts reads and writes, possibly returning stale data.
sequenceDiagram
participant C as Client
participant N1 as Node 1 (accepts writes)
participant N2 as Node 2 (has old data)
Note over N1,N2: 🔴 Network Partition
C->>N1: Write X = 10
N1-->>C: ✅ OK (written to N1 only)
C->>N2: Read X
N2-->>C: ✅ X = 5 (stale!)
Note over C,N2: System accepts requests<br/>but data is inconsistent
Examples:
- Cassandra
- DynamoDB
- CouchDB
- Riak
Use when: Always responding is more important than immediate consistency (social media, caching, analytics)
CA: Consistency + Availability
Only possible in single-node systems
Without network partitions (single node or perfectly reliable network), you can have both C and A.
graph TB
Single["Single Node Database"]
Client["Client"]
Client --> Single
Single <--> Client
Note1[No network = No partitions]
Note --> Single
style Single fill:#fff9c4
Examples:
- Single-node PostgreSQL
- Single-node MongoDB
- Traditional RDBMS on one server
Reality: In distributed systems, CA is not achievable because networks are not perfectly reliable.
Real-World CAP Examples
| System | CAP Choice | Notes |
|---|---|---|
| Google Spanner | CP | External consistency, always consistent |
| Amazon DynamoDB | AP | Configurable consistency |
| Cassandra | AP | Always writable, tunable consistency |
| MongoDB | CP (default) | Configurable to AP |
| Redis Cluster | AP | Async replication |
| PostgreSQL | CA | Single-node mode |
| CockroachDB | CP | Serializability, handles partitions |
| Couchbase | AP | Cross Data Center Replication |
Consistency Models
The CAP theorem's "Consistency" is actually linearizability (strong consistency). There are many consistency models:
graph TB
subgraph "Consistency Spectrum"
Strong["Strong Consistency<br/>Linearizability"]
Weak["Weak Consistency<br/>Eventual Consistency"]
Strong --> S1["Sequential<br/>Consistency"]
S1 --> S2["Causal<br/>Consistency"]
S2 --> S3["Session<br/>Consistency"]
S3 --> S4["Read Your<br/>Writes"]
S4 --> Weak
end
Strong Consistency Models
| Model | Description | Example |
|---|---|---|
| Linearizable | Most recent read guaranteed | Bank transfers |
| Sequential | Operations appear in some order | Version control |
| Causal | Causally related operations ordered | Chat applications |
Weak Consistency Models
| Model | Description | Example |
|---|---|---|
| Read Your Writes | User sees their own writes | Social media profile |
| Session Consistency | Consistency within a session | Shopping cart |
| Eventual Consistency | System converges over time | DNS, CDN |
Practical Example: Shopping Cart
Let's see how different CAP choices affect a shopping cart system:
CP Approach (Block on Partition)
sequenceDiagram
participant U as User
participant S as Service
Note over U,S: 🔴 Network partition detected
U->>S: Add item to cart
S-->>U: ❌ Error: Service unavailable
Note over U,S: User frustrated,<br/>but cart is always accurate
Trade-off: Lost sales, accurate cart
AP Approach (Accept Writes)
sequenceDiagram
participant U as User
participant S as Service
Note over U,S: 🔴 Network partition detected
U->>S: Add item to cart
S-->>U: ✅ OK (written locally)
Note over U,S: User happy,<br/>but cart might conflict
Trade-off: Happy users, possible merge conflicts later
The "2 of 3" Simplification
The CAP theorem is often misunderstood. The reality is more nuanced:
graph TB
subgraph "CAP Reality"
CAP["CAP Theorem"]
CAP --> Misconception["You must choose<br/>exactly 2"]
CAP --> Reality["You can have all 3<br/>in normal operation"]
CAP --> Truth["During partition,<br/>choose C or A"]
end
Key Insights:
- P is mandatory in distributed systems
- During normal operation, you can have C + A + P
- During a partition, you choose between C and A
- Many systems are configurable (e.g., DynamoDB)
Design Guidelines
Choose CP When:
- ✅ Financial transactions
- ✅ Inventory management
- ✅ Authentication/authorization
- ✅ Any system where stale data is unacceptable
Choose AP When:
- ✅ Social media feeds
- ✅ Product recommendations
- ✅ Analytics and logging
- ✅ Any system where availability is critical
Techniques to Balance C and A:
| Technique | Description | Example |
|---|---|---|
| Quorum reads/writes | Require majority acknowledgment | DynamoDB |
| Tunable consistency | Let client choose per operation | Cassandra |
| Graceful degradation | Switch modes during partition | Many systems |
| Conflict resolution | Merge divergent data later | CRDTs |
Summary
Key Takeaways
- CAP theorem: You can't have all three in a partition
- Partition tolerance is mandatory in distributed systems
- Real choice: Consistency vs Availability during partition
- Many systems offer tunable consistency levels
- Your use case determines the right trade-off
Check Your Understanding
- Why is partition tolerance not optional in distributed systems?
- Give an example where you would choose CP over AP
- What happens to an AP system during a network partition?
- How can quorum reads/writes help balance C and A?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Now that we understand CAP trade-offs, let's build a simple key-value store: Store Basics
Store System Basics
Session 3, Part 3 - 35 minutes (coding demo + hands-on)
Learning Objectives
- Understand the key-value store data model
- Build a single-node key-value store in TypeScript
- Build the same store in Python
- Deploy and test the store using Docker Compose
- Perform basic read/write operations via HTTP
What is a Key-Value Store?
A key-value store is the simplest type of database:
graph LR
subgraph "Key-Value Store"
KV[("Data Store")]
K1["name"] --> V1[""Alice""]
K2["age"] --> V2["30"]
K3["city"] --> V3[""NYC""]
K4["active"] --> V4["true"]
K1 --> KV
K2 --> KV
K3 --> KV
K4 --> KV
end
Key Characteristics:
- Simple data model: key → value
- Fast lookups by key
- No complex queries
- Schema-less
Basic Operations
| Operation | Description | Example |
|---|---|---|
| SET | Store a value for a key | SET user:1 Alice |
| GET | Retrieve a value by key | GET user:1 → "Alice" |
| DELETE | Remove a key | DELETE user:1 |
stateDiagram-v2
[*] --> NotExists
NotExists --> Exists: SET key
Exists --> Exists: SET key (update)
Exists --> NotExists: DELETE key
Exists --> Exists: GET key (read)
NotExists --> [*]: GET key (null)
Implementation
We'll build a simple HTTP-based key-value store with REST API endpoints.
API Design
GET /key/{key} - Get value by key
PUT /key/{key} - Set value for key
DELETE /key/{key} - Delete key
GET /keys - List all keys
TypeScript Implementation
Project Structure
store-basics-ts/
├── package.json
├── tsconfig.json
├── Dockerfile
└── src/
└── store.ts # Complete store implementation
Complete TypeScript Code
store-basics-ts/src/store.ts
import http from 'http';
/**
* Simple in-memory key-value store
*/
class KeyValueStore {
private data: Map<string, any> = new Map();
/**
* Set a key-value pair
*/
set(key: string, value: any): void {
this.data.set(key, value);
console.log(`[Store] SET ${key} = ${JSON.stringify(value)}`);
}
/**
* Get a value by key
*/
get(key: string): any {
const value = this.data.get(key);
console.log(`[Store] GET ${key} => ${value !== undefined ? JSON.stringify(value) : 'null'}`);
return value;
}
/**
* Delete a key
*/
delete(key: string): boolean {
const existed = this.data.delete(key);
console.log(`[Store] DELETE ${key} => ${existed ? 'success' : 'not found'}`);
return existed;
}
/**
* Get all keys
*/
keys(): string[] {
return Array.from(this.data.keys());
}
/**
* Get store statistics
*/
stats() {
return {
totalKeys: this.data.size,
keys: this.keys()
};
}
}
// Create the store instance
const store = new KeyValueStore();
/**
* HTTP Server with key-value API
*/
const server = http.createServer((req, res) => {
// Enable CORS
res.setHeader('Access-Control-Allow-Origin', '*');
res.setHeader('Access-Control-Allow-Methods', 'GET, PUT, DELETE, OPTIONS');
res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
if (req.method === 'OPTIONS') {
res.writeHead(200);
res.end();
return;
}
// Parse URL
const url = new URL(req.url || '', `http://${req.headers.host}`);
// Route: GET /keys - List all keys
if (req.method === 'GET' && url.pathname === '/keys') {
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify(store.stats()));
return;
}
// Route: GET /key/{key} - Get value
if (req.method === 'GET' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5); // Remove '/key/'
const value = store.get(key);
if (value !== undefined) {
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ key, value }));
} else {
res.writeHead(404, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'Key not found', key }));
}
return;
}
// Route: PUT /key/{key} - Set value
if (req.method === 'PUT' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5); // Remove '/key/'
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const value = JSON.parse(body);
store.set(key, value);
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ success: true, key, value }));
} catch (error) {
res.writeHead(400, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'Invalid JSON' }));
}
});
return;
}
// Route: DELETE /key/{key} - Delete key
if (req.method === 'DELETE' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5); // Remove '/key/'
const existed = store.delete(key);
if (existed) {
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ success: true, key }));
} else {
res.writeHead(404, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'Key not found', key }));
}
return;
}
// 404 - Not found
res.writeHead(404, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'Not found' }));
});
const PORT = process.env.PORT || 4000;
server.listen(PORT, () => {
console.log(`Key-Value Store listening on port ${PORT}`);
console.log(`\nAvailable endpoints:`);
console.log(` GET /key/{key} - Get value by key`);
console.log(` PUT /key/{key} - Set value for key`);
console.log(` DELETE /key/{key} - Delete key`);
console.log(` GET /keys - List all keys`);
});
store-basics-ts/package.json
{
"name": "store-basics-ts",
"version": "1.0.0",
"description": "Simple key-value store in TypeScript",
"main": "dist/store.js",
"scripts": {
"build": "tsc",
"start": "node dist/store.js",
"dev": "ts-node src/store.ts"
},
"dependencies": {},
"devDependencies": {
"@types/node": "^20.0.0",
"typescript": "^5.0.0",
"ts-node": "^10.9.0"
}
}
store-basics-ts/tsconfig.json
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true
},
"include": ["src/**/*"]
}
store-basics-ts/Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 4000
CMD ["npm", "start"]
Python Implementation
Project Structure
store-basics-py/
├── requirements.txt
├── Dockerfile
└── src/
└── store.py # Complete store implementation
Complete Python Code
store-basics-py/src/store.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
from typing import Any, Dict
from urllib.parse import urlparse
class KeyValueStore:
"""Simple in-memory key-value store."""
def __init__(self):
self.data: Dict[str, Any] = {}
def set(self, key: str, value: Any) -> None:
"""Store a key-value pair."""
self.data[key] = value
print(f"[Store] SET {key} = {json.dumps(value)}")
def get(self, key: str) -> Any:
"""Get value by key."""
value = self.data.get(key)
print(f"[Store] GET {key} => {json.dumps(value) if value is not None else 'null'}")
return value
def delete(self, key: str) -> bool:
"""Delete a key."""
existed = key in self.data
if existed:
del self.data[key]
print(f"[Store] DELETE {key} => {'success' if existed else 'not found'}")
return existed
def keys(self) -> list:
"""Get all keys."""
return list(self.data.keys())
def stats(self) -> dict:
"""Get store statistics."""
return {
'totalKeys': len(self.data),
'keys': self.keys()
}
# Create the store instance
store = KeyValueStore()
class StoreHandler(BaseHTTPRequestHandler):
"""HTTP request handler for key-value store."""
def send_json_response(self, status: int, data: dict):
"""Send a JSON response."""
self.send_response(status)
self.send_header('Content-Type', 'application/json')
self.send_header('Access-Control-Allow-Origin', '*')
self.end_headers()
self.wfile.write(json.dumps(data).encode())
def do_OPTIONS(self):
"""Handle CORS preflight requests."""
self.send_response(200)
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET, PUT, DELETE, OPTIONS')
self.send_header('Access-Control-Allow-Headers', 'Content-Type')
self.end_headers()
def do_GET(self):
"""Handle GET requests."""
parsed = urlparse(self.path)
# GET /keys - List all keys
if parsed.path == '/keys':
self.send_json_response(200, store.stats())
return
# GET /key/{key} - Get value
if parsed.path.startswith('/key/'):
key = parsed.path[5:] # Remove '/key/'
value = store.get(key)
if value is not None:
self.send_json_response(200, {'key': key, 'value': value})
else:
self.send_json_response(404, {'error': 'Key not found', 'key': key})
return
# 404
self.send_json_response(404, {'error': 'Not found'})
def do_PUT(self):
"""Handle PUT requests (set value)."""
parsed = urlparse(self.path)
# PUT /key/{key} - Set value
if parsed.path.startswith('/key/'):
key = parsed.path[5:] # Remove '/key/'
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
value = json.loads(body)
store.set(key, value)
self.send_json_response(200, {'success': True, 'key': key, 'value': value})
except json.JSONDecodeError:
self.send_json_response(400, {'error': 'Invalid JSON'})
return
# 404
self.send_json_response(404, {'error': 'Not found'})
def do_DELETE(self):
"""Handle DELETE requests."""
parsed = urlparse(self.path)
# DELETE /key/{key} - Delete key
if parsed.path.startswith('/key/'):
key = parsed.path[5:] # Remove '/key/'
existed = store.delete(key)
if existed:
self.send_json_response(200, {'success': True, 'key': key})
else:
self.send_json_response(404, {'error': 'Key not found', 'key': key})
return
# 404
self.send_json_response(404, {'error': 'Not found'})
def log_message(self, format, *args):
"""Suppress default logging."""
pass
def run_server(port: int = 4000):
"""Start the HTTP server."""
server_address = ('', port)
httpd = HTTPServer(server_address, StoreHandler)
print(f"Key-Value Store listening on port {port}")
print(f"\nAvailable endpoints:")
print(f" GET /key/{{key}} - Get value by key")
print(f" PUT /key/{{key}} - Set value for key")
print(f" DELETE /key/{{key}} - Delete key")
print(f" GET /keys - List all keys")
httpd.serve_forever()
if __name__ == '__main__':
import os
port = int(os.environ.get('PORT', 4000))
run_server(port)
store-basics-py/requirements.txt
# No external dependencies required - uses standard library only
store-basics-py/Dockerfile
FROM python:3.11-alpine
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 4000
CMD ["python", "src/store.py"]
Docker Compose Setup
TypeScript Version
examples/02-store/ts/docker-compose.yml
version: '3.8'
services:
store:
build: .
ports:
- "4000:4000"
environment:
- PORT=4000
volumes:
- ./src:/app/src
Python Version
examples/02-store/py/docker-compose.yml
version: '3.8'
services:
store:
build: .
ports:
- "4000:4000"
environment:
- PORT=4000
volumes:
- ./src:/app/src
Running the Example
Step 1: Start the Store
TypeScript:
cd examples/02-store/ts
docker-compose up --build
Python:
cd examples/02-store/py
docker-compose up --build
You should see:
store | Key-Value Store listening on port 4000
store |
store | Available endpoints:
store | GET /key/{key} - Get value by key
store | PUT /key/{key} - Set value for key
store | DELETE /key/{key} - Delete key
store | GET /keys - List all keys
Step 2: Store Some Values
# Store a string
curl -X PUT http://localhost:4000/key/name \
-H "Content-Type: application/json" \
-d '"Alice"'
# Store a number
curl -X PUT http://localhost:4000/key/age \
-H "Content-Type: application/json" \
-d '30'
# Store an object
curl -X PUT http://localhost:4000/key/user:1 \
-H "Content-Type: application/json" \
-d '{"name": "Alice", "age": 30, "city": "NYC"}'
# Store a list
curl -X PUT http://localhost:4000/key/tags \
-H "Content-Type: application/json" \
-d '["distributed", "systems", "course"]'
Step 3: Retrieve Values
# Get a string
curl http://localhost:4000/key/name
# Response: {"key":"name","value":"Alice"}
# Get a number
curl http://localhost:4000/key/age
# Response: {"key":"age","value":30}
# Get an object
curl http://localhost:4000/key/user:1
# Response: {"key":"user:1","value":{"name":"Alice","age":30,"city":"NYC"}}
# Get a list
curl http://localhost:4000/key/tags
# Response: {"key":"tags","value":["distributed","systems","course"]}
# Try to get non-existent key
curl http://localhost:4000/key/nonexistent
# Response: {"error":"Key not found","key":"nonexistent"}
Step 4: List All Keys
curl http://localhost:4000/keys
# Response: {"totalKeys":4,"keys":["name","age","user:1","tags"]}
Step 5: Delete a Key
# Delete a key
curl -X DELETE http://localhost:4000/key/age
# Response: {"success":true,"key":"age"}
# Verify it's gone
curl http://localhost:4000/key/age
# Response: {"error":"Key not found","key":"age"}
# Check remaining keys
curl http://localhost:4000/keys
# Response: {"totalKeys":3,"keys":["name","user:1","tags"]}
System Architecture
graph TB
subgraph "Single-Node Key-Value Store"
Client["Client Applications"]
API["HTTP API"]
Store[("In-Memory<br/>Data Store")]
Client -->|"GET/PUT/DELETE"| API
API --> Store
end
style Store fill:#f9f,stroke:#333,stroke-width:3px
Exercises
Exercise 1: Add TTL (Time-To-Live) Support
Modify the store to automatically expire keys after a specified time:
- Add an optional
ttlparameter to the SET operation - Track when each key should expire
- Return null for expired keys
- Implement a cleanup mechanism
Hint: Store metadata alongside values, or use a separate expiration map.
Exercise 2: Add Key Patterns
Add wildcard support for key lookups:
- Implement
GET /keys?pattern=user:*to list matching keys - Support simple
*wildcard matching - Test with patterns like
user:*,*:admin, etc.
Exercise 3: Add Data Persistence
Currently data is lost when the server restarts. Add persistence:
- Save data to a JSON file on every write
- Load data from file on startup
- Handle concurrent writes safely
Summary
Key Takeaways
- Key-value stores are simple but powerful data storage systems
- Basic operations: SET, GET, DELETE
- HTTP API provides a simple interface for remote access
- Single-node stores are CA (Consistent + Available) from CAP perspective
- Next steps: Add replication for fault tolerance (Session 4)
Check Your Understanding
- What are the four basic operations we implemented?
- How does our store handle requests for non-existent keys?
- What happens to the data when the Docker container stops?
- Why is this single-node store "CA" in CAP terms?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Our simple store works, but what happens when the node fails? Let's add replication: Replication (Session 4)
Replication and Leader Election
Session 4 - Full session
Learning Objectives
- Understand why we replicate data
- Learn single-leader vs multi-leader replication
- Implement leader-based replication
- Build a simple leader election mechanism
- Deploy a 3-node replicated store
Why Replicate Data?
In our single-node store from Session 3, what happens when the node fails?
Answer: All data is lost and the system becomes unavailable.
graph LR
subgraph "Single Node - No Fault Tolerance"
C[Clients] --> N[Node 1]
N1[Node 1<br/>❌ FAILED]
style N1 fill:#f66,stroke:#333,stroke-width:3px
end
Replication solves this by keeping copies of data on multiple nodes:
graph TB
subgraph "Replicated Store - Fault Tolerant"
C[Clients]
L[Leader<br/>Node 1]
F1[Follower<br/>Node 2]
F2[Follower<br/>Node 3]
C --> L
L -->|"replicate"| F1
L -->|"replicate"| F2
end
style L fill:#6f6,stroke:#333,stroke-width:3px
Benefits of Replication:
- Fault tolerance: If one node fails, others have the data
- Read scaling: Clients can read from any replica
- Low latency: Place replicas closer to users
- High availability: System continues during node failures
Replication Strategies
Single-Leader Replication
Also called: primary-replica, master-slave, active-passive
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
Note over C,F2: Write Operation
C->>L: PUT /key/name "Alice"
L->>L: Write to local storage
L->>F1: Replicate: SET name = "Alice"
L->>F2: Replicate: SET name = "Alice"
F1->>L: ACK
F2->>L: ACK
L->>C: Response: Success
Note over C,F2: Read Operation
C->>L: GET /key/name
L->>C: Response: "Alice"
Note over C,F2: Or read from follower
C->>F1: GET /key/name
F1->>C: Response: "Alice"
Characteristics:
- Leader handles all writes
- Followers replicate from leader
- Reads can go to leader or followers
- Simple consistency model
Multi-Leader Replication
Also called: multi-master, active-active
graph TB
subgraph "Multi-Leader Replication"
C1[Client 1]
C2[Client 2]
L1[Leader 1<br/>Datacenter A]
L2[Leader 2<br/>Datacenter B]
F1[Follower 1]
F2[Follower 2]
C1 --> L1
C2 --> L2
L1 <-->|"resolve conflicts"| L2
L1 --> F1
L2 --> F2
end
style L1 fill:#6f6,stroke:#333,stroke-width:3px
style L2 fill:#6f6,stroke:#333,stroke-width:3px
Characteristics:
- Multiple nodes accept writes
- More complex conflict resolution
- Better for geo-distributed setups
- We won't implement this (advanced topic)
Synchronous vs Asynchronous Replication
sequenceDiagram
participant C as Client
participant L as Leader
par Synchronous Replication
L->>F: Replicate write
F->>L: ACK (must wait)
L->>C: Success (after replicas confirm)
and Asynchronous Replication
L->>C: Success (immediately)
L--xF: Replicate in background
end
participant F as Follower
| Strategy | Pros | Cons |
|---|---|---|
| Synchronous | Strong consistency, no data loss | Slower writes, blocking |
| Asynchronous | Fast writes, non-blocking | Data loss on leader failure, stale reads |
For this course, we'll use asynchronous replication for simplicity.
Leader Election
When the leader fails, followers must elect a new leader:
stateDiagram-v2
[*] --> Follower: Node starts
Follower --> Candidate: No heartbeat from leader
Candidate --> Leader: Wins election (majority votes)
Candidate --> Follower: Loses election
Leader --> Follower: Detects higher term/node
Follower --> [*]: Node stops
The Bully Algorithm
A simple leader election algorithm:
- Detect leader failure: No heartbeat for timeout period
- Start election: Node with highest ID becomes leader candidate
- Vote: Lower-numbered nodes vote for the candidate
- Become leader: Candidate becomes leader if majority agrees
sequenceDiagram
participant N1 as Node 1<br/>(Leader)
participant N2 as Node 2
participant N3 as Node 3
Note over N1,N3: Normal Operation
N1->>N2: Heartbeat
N1->>N3: Heartbeat
Note over N1,N3: Leader Fails
N1--xN2: Heartbeat timeout!
N1--xN3: Heartbeat timeout!
Note over N2,N3: Election Starts
N2->>N3: Vote request (ID=2)
N3->>N2: Vote for N2 (2 > 3? No, wait)
Note over N2,N3: Actually, N3 has higher ID
N3->>N2: Vote request (ID=3)
N2->>N3: Vote for N3 (3 > 2, yes!)
Note over N2,N3: N3 Becomes Leader
N3->>N2: I am the leader
N3->>N2: Heartbeat
For simplicity, we'll use a simpler approach:
- Lowest node ID becomes leader
- If leader fails, next lowest becomes leader
- No voting, just order-based selection
Implementation
TypeScript Implementation
Project Structure:
replicated-store-ts/
├── package.json
├── tsconfig.json
├── Dockerfile
├── docker-compose.yml
└── src/
└── node.ts # Replicated node with leader election
replicated-store-ts/src/node.ts
import http from 'http';
/**
* Node configuration
*/
const config = {
nodeId: process.env.NODE_ID || 'node-1',
port: parseInt(process.env.PORT || '4000'),
peers: (process.env.PEERS || '').split(',').filter(Boolean),
heartbeatInterval: 2000, // ms
electionTimeout: 6000, // ms
};
type NodeRole = 'leader' | 'follower' | 'candidate';
/**
* Replicated Store Node
*/
class StoreNode {
public nodeId: string;
public role: NodeRole;
public term: number;
public data: Map<string, any>;
public peers: string[];
private leaderId: string | null;
private lastHeartbeat: number;
private heartbeatTimer?: NodeJS.Timeout;
private electionTimer?: NodeJS.Timeout;
constructor(nodeId: string, peers: string[]) {
this.nodeId = nodeId;
this.role = 'follower';
this.term = 0;
this.data = new Map();
this.peers = peers;
this.leaderId = null;
this.lastHeartbeat = Date.now();
this.startElectionTimer();
this.startHeartbeat();
}
/**
* Start election timeout timer
*/
private startElectionTimer() {
this.electionTimer = setTimeout(() => {
const timeSinceHeartbeat = Date.now() - this.lastHeartbeat;
if (timeSinceHeartbeat > config.electionTimeout && this.role !== 'leader') {
console.log(`[${this.nodeId}] Election timeout! Starting election...`);
this.startElection();
}
this.startElectionTimer();
}, config.electionTimeout);
}
/**
* Start leader election (simplified: lowest ID wins)
*/
private startElection() {
this.term++;
this.role = 'candidate';
// Simple strategy: lowest node ID becomes leader
const allNodes = [this.nodeId, ...this.peers].sort();
const lowestNode = allNodes[0];
if (this.nodeId === lowestNode) {
this.becomeLeader();
} else {
this.role = 'follower';
this.leaderId = lowestNode;
console.log(`[${this.nodeId}] Waiting for ${lowestNode} to become leader`);
}
}
/**
* Become the leader
*/
private becomeLeader() {
this.role = 'leader';
this.leaderId = this.nodeId;
console.log(`[${this.nodeId}] 👑 Became LEADER for term ${this.term}`);
// Immediately replicate to followers
this.replicateToFollowers();
}
/**
* Start heartbeat to followers
*/
private startHeartbeat() {
this.heartbeatTimer = setInterval(() => {
if (this.role === 'leader') {
this.sendHeartbeat();
}
}, config.heartbeatInterval);
}
/**
* Send heartbeat to all followers
*/
private sendHeartbeat() {
const heartbeat = {
type: 'heartbeat',
leaderId: this.nodeId,
term: this.term,
timestamp: Date.now(),
};
this.peers.forEach(peerUrl => {
this.sendToPeer(peerUrl, '/internal/heartbeat', heartbeat)
.catch(err => console.log(`[${this.nodeId}] Failed to send heartbeat to ${peerUrl}:`, err.message));
});
}
/**
* Replicate data to all followers
*/
private replicateToFollowers() {
// Convert Map to object for replication
const dataObj = Object.fromEntries(this.data);
this.peers.forEach(peerUrl => {
this.sendToPeer(peerUrl, '/internal/replicate', {
type: 'replicate',
leaderId: this.nodeId,
term: this.term,
data: dataObj,
}).catch(err => console.log(`[${this.nodeId}] Replication failed to ${peerUrl}:`, err.message));
});
}
/**
* Handle heartbeat from leader
*/
handleHeartbeat(heartbeat: any) {
if (heartbeat.term >= this.term) {
this.term = heartbeat.term;
this.lastHeartbeat = Date.now();
this.leaderId = heartbeat.leaderId;
this.role = 'follower';
if (this.role !== 'follower') {
console.log(`[${this.nodeId}] Stepping down to follower, term ${this.term}`);
}
}
}
/**
* Handle replication from leader
*/
handleReplication(message: any) {
if (message.term >= this.term) {
this.term = message.term;
this.leaderId = message.leaderId;
this.role = 'follower';
this.lastHeartbeat = Date.now();
// Merge replicated data
Object.entries(message.data).forEach(([key, value]) => {
this.data.set(key, value);
});
console.log(`[${this.nodeId}] Replicated ${Object.keys(message.data).length} keys from leader`);
}
}
/**
* Send data to peer node
*/
private async sendToPeer(peerUrl: string, path: string, data: any): Promise<void> {
return new Promise((resolve, reject) => {
const url = new URL(path, peerUrl);
const options = {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
};
const req = http.request(url, options, (res) => {
if (res.statusCode === 200) {
resolve();
} else {
reject(new Error(`Status ${res.statusCode}`));
}
});
req.on('error', reject);
req.write(JSON.stringify(data));
req.end();
});
}
/**
* Set a key-value pair (only on leader)
*/
set(key: string, value: any): boolean {
if (this.role !== 'leader') {
return false;
}
this.data.set(key, value);
console.log(`[${this.nodeId}] SET ${key} = ${JSON.stringify(value)}`);
// Replicate to followers
this.replicateToFollowers();
return true;
}
/**
* Get a value by key
*/
get(key: string): any {
const value = this.data.get(key);
console.log(`[${this.nodeId}] GET ${key} => ${value !== undefined ? JSON.stringify(value) : 'null'}`);
return value;
}
/**
* Delete a key
*/
delete(key: string): boolean {
if (this.role !== 'leader') {
return false;
}
const existed = this.data.delete(key);
console.log(`[${this.nodeId}] DELETE ${key} => ${existed ? 'success' : 'not found'}`);
// Replicate to followers
this.replicateToFollowers();
return existed;
}
/**
* Get node status
*/
getStatus() {
return {
nodeId: this.nodeId,
role: this.role,
term: this.term,
leaderId: this.leaderId,
totalKeys: this.data.size,
keys: Array.from(this.data.keys()),
};
}
}
// Create the node
const node = new StoreNode(config.nodeId, config.peers);
/**
* HTTP Server
*/
const server = http.createServer((req, res) => {
res.setHeader('Content-Type', 'application/json');
res.setHeader('Access-Control-Allow-Origin', '*');
res.setHeader('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS');
res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
if (req.method === 'OPTIONS') {
res.writeHead(200);
res.end();
return;
}
const url = new URL(req.url || '', `http://${req.headers.host}`);
// Route: POST /internal/heartbeat - Leader heartbeat
if (req.method === 'POST' && url.pathname === '/internal/heartbeat') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const heartbeat = JSON.parse(body);
node.handleHeartbeat(heartbeat);
res.writeHead(200);
res.end(JSON.stringify({ success: true }));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid request' }));
}
});
return;
}
// Route: POST /internal/replicate - Replication from leader
if (req.method === 'POST' && url.pathname === '/internal/replicate') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const message = JSON.parse(body);
node.handleReplication(message);
res.writeHead(200);
res.end(JSON.stringify({ success: true }));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid request' }));
}
});
return;
}
// Route: GET /status - Node status
if (req.method === 'GET' && url.pathname === '/status') {
res.writeHead(200);
res.end(JSON.stringify(node.getStatus()));
return;
}
// Route: GET /key/{key} - Get value
if (req.method === 'GET' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5);
const value = node.get(key);
if (value !== undefined) {
res.writeHead(200);
res.end(JSON.stringify({ key, value, nodeRole: node.role }));
} else {
res.writeHead(404);
res.end(JSON.stringify({ error: 'Key not found', key }));
}
return;
}
// Route: PUT /key/{key} - Set value (leader only)
if (req.method === 'PUT' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5);
if (node.role !== 'leader') {
res.writeHead(503);
res.end(JSON.stringify({
error: 'Not the leader',
currentRole: node.role,
leaderId: node.leaderId || 'Unknown',
}));
return;
}
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const value = JSON.parse(body);
node.set(key, value);
res.writeHead(200);
res.end(JSON.stringify({ success: true, key, value, leaderId: node.nodeId }));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid JSON' }));
}
});
return;
}
// Route: DELETE /key/{key} - Delete key (leader only)
if (req.method === 'DELETE' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5);
if (node.role !== 'leader') {
res.writeHead(503);
res.end(JSON.stringify({
error: 'Not the leader',
currentRole: node.role,
leaderId: node.leaderId || 'Unknown',
}));
return;
}
const existed = node.delete(key);
if (existed) {
res.writeHead(200);
res.end(JSON.stringify({ success: true, key, leaderId: node.nodeId }));
} else {
res.writeHead(404);
res.end(JSON.stringify({ error: 'Key not found', key }));
}
return;
}
// 404
res.writeHead(404);
res.end(JSON.stringify({ error: 'Not found' }));
});
server.listen(config.port, () => {
console.log(`[${config.nodeId}] Store Node listening on port ${config.port}`);
console.log(`[${config.nodeId}] Peers: ${config.peers.join(', ') || 'none'}`);
console.log(`[${config.nodeId}] Available endpoints:`);
console.log(` GET /status - Node status and role`);
console.log(` GET /key/{key} - Get value`);
console.log(` PUT /key/{key} - Set value (leader only)`);
console.log(` DEL /key/{key} - Delete key (leader only)`);
});
replicated-store-ts/package.json
{
"name": "replicated-store-ts",
"version": "1.0.0",
"description": "Replicated key-value store with leader election in TypeScript",
"main": "dist/node.js",
"scripts": {
"build": "tsc",
"start": "node dist/node.js",
"dev": "ts-node src/node.ts"
},
"dependencies": {},
"devDependencies": {
"@types/node": "^20.0.0",
"typescript": "^5.0.0",
"ts-node": "^10.9.0"
}
}
replicated-store-ts/tsconfig.json
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true
},
"include": ["src/**/*"]
}
replicated-store-ts/Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 4000
CMD ["npm", "start"]
Python Implementation
replicated-store-py/src/node.py
import os
import json
import time
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Any, Dict, List, Optional
from urllib.parse import urlparse, parse_qs
from urllib.request import Request, urlopen
from urllib.error import URLError
class StoreNode:
"""Replicated store node with leader election."""
def __init__(self, node_id: str, peers: List[str]):
self.node_id = node_id
self.role: str = 'follower' # leader, follower, candidate
self.term = 0
self.data: Dict[str, Any] = {}
self.peers = peers
self.leader_id: Optional[str] = None
self.last_heartbeat = time.time()
# Configuration
self.heartbeat_interval = 2.0 # seconds
self.election_timeout = 6.0 # seconds
# Start election timer
self.start_election_timer()
# Start heartbeat thread
self.start_heartbeat_thread()
def start_election_timer(self):
"""Start election timeout timer."""
def election_timer():
while True:
time.sleep(1)
time_since = time.time() - self.last_heartbeat
if time_since > self.election_timeout and self.role != 'leader':
print(f"[{self.node_id}] Election timeout! Starting election...")
self.start_election()
thread = threading.Thread(target=election_timer, daemon=True)
thread.start()
def start_election(self):
"""Start leader election (simplest: lowest ID wins)."""
self.term += 1
self.role = 'candidate'
# Simple strategy: lowest node ID becomes leader
all_nodes = sorted([self.node_id] + self.peers)
lowest_node = all_nodes[0]
if self.node_id == lowest_node:
self.become_leader()
else:
self.role = 'follower'
self.leader_id = lowest_node
print(f"[{self.node_id}] Waiting for {lowest_node} to become leader")
def become_leader(self):
"""Become the leader."""
self.role = 'leader'
self.leader_id = self.node_id
print(f"[{self.node_id}] 👑 Became LEADER for term {self.term}")
# Immediately replicate to followers
self.replicate_to_followers()
def start_heartbeat_thread(self):
"""Start heartbeat to followers."""
def heartbeat_loop():
while True:
time.sleep(self.heartbeat_interval)
if self.role == 'leader':
self.send_heartbeat()
thread = threading.Thread(target=heartbeat_loop, daemon=True)
thread.start()
def send_heartbeat(self):
"""Send heartbeat to all followers."""
heartbeat = {
'type': 'heartbeat',
'leader_id': self.node_id,
'term': self.term,
'timestamp': int(time.time() * 1000),
}
for peer in self.peers:
try:
self.send_to_peer(peer, '/internal/heartbeat', heartbeat)
except Exception as e:
print(f"[{self.node_id}] Failed to send heartbeat to {peer}: {e}")
def replicate_to_followers(self):
"""Replicate data to all followers."""
message = {
'type': 'replicate',
'leader_id': self.node_id,
'term': self.term,
'data': self.data,
}
for peer in self.peers:
try:
self.send_to_peer(peer, '/internal/replicate', message)
except Exception as e:
print(f"[{self.node_id}] Replication failed to {peer}: {e}")
def handle_heartbeat(self, heartbeat: dict):
"""Handle heartbeat from leader."""
if heartbeat['term'] >= self.term:
self.term = heartbeat['term']
self.last_heartbeat = time.time()
self.leader_id = heartbeat['leader_id']
if self.role != 'follower':
print(f"[{self.node_id}] Stepping down to follower, term {self.term}")
self.role = 'follower'
def handle_replication(self, message: dict):
"""Handle replication from leader."""
if message['term'] >= self.term:
self.term = message['term']
self.leader_id = message['leader_id']
self.role = 'follower'
self.last_heartbeat = time.time()
# Merge replicated data
self.data.update(message['data'])
print(f"[{self.node_id}] Replicated {len(message['data'])} keys from leader")
def send_to_peer(self, peer_url: str, path: str, data: dict) -> None:
"""Send data to peer node."""
url = f"{peer_url}{path}"
body = json.dumps(data).encode('utf-8')
req = Request(url, data=body, headers={'Content-Type': 'application/json'}, method='POST')
with urlopen(req, timeout=1) as response:
if response.status != 200:
raise Exception(f"Status {response.status}")
def set(self, key: str, value: Any) -> bool:
"""Set a key-value pair (only on leader)."""
if self.role != 'leader':
return False
self.data[key] = value
print(f"[{self.node_id}] SET {key} = {json.dumps(value)}")
# Replicate to followers
self.replicate_to_followers()
return True
def get(self, key: str) -> Any:
"""Get a value by key."""
value = self.data.get(key)
print(f"[{self.node_id}] GET {key} => {json.dumps(value) if value is not None else 'null'}")
return value
def delete(self, key: str) -> bool:
"""Delete a key (only on leader)."""
if self.role != 'leader':
return False
existed = key in self.data
if existed:
del self.data[key]
print(f"[{self.node_id}] DELETE {key} => {'success' if existed else 'not found'}")
# Replicate to followers
self.replicate_to_followers()
return existed
def get_status(self) -> dict:
"""Get node status."""
return {
'node_id': self.node_id,
'role': self.role,
'term': self.term,
'leader_id': self.leader_id,
'total_keys': len(self.data),
'keys': list(self.data.keys()),
}
# Create the node
config = {
'node_id': os.environ.get('NODE_ID', 'node-1'),
'port': int(os.environ.get('PORT', '4000')),
'peers': [p for p in os.environ.get('PEERS', '').split(',') if p],
}
node = StoreNode(config['node_id'], config['peers'])
class NodeHandler(BaseHTTPRequestHandler):
"""HTTP request handler for store node."""
def send_json_response(self, status: int, data: dict):
"""Send a JSON response."""
self.send_response(status)
self.send_header('Content-Type', 'application/json')
self.send_header('Access-Control-Allow-Origin', '*')
self.end_headers()
self.wfile.write(json.dumps(data).encode())
def do_OPTIONS(self):
"""Handle CORS preflight."""
self.send_response(200)
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS')
self.send_header('Access-Control-Allow-Headers', 'Content-Type')
self.end_headers()
def do_POST(self):
"""Handle POST requests."""
parsed = urlparse(self.path)
# POST /internal/heartbeat
if parsed.path == '/internal/heartbeat':
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
heartbeat = json.loads(body)
node.handle_heartbeat(heartbeat)
self.send_json_response(200, {'success': True})
except (json.JSONDecodeError, KeyError):
self.send_json_response(400, {'error': 'Invalid request'})
return
# POST /internal/replicate
if parsed.path == '/internal/replicate':
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
message = json.loads(body)
node.handle_replication(message)
self.send_json_response(200, {'success': True})
except (json.JSONDecodeError, KeyError):
self.send_json_response(400, {'error': 'Invalid request'})
return
self.send_json_response(404, {'error': 'Not found'})
def do_GET(self):
"""Handle GET requests."""
parsed = urlparse(self.path)
# GET /status
if parsed.path == '/status':
self.send_json_response(200, node.get_status())
return
# GET /key/{key}
if parsed.path.startswith('/key/'):
key = parsed.path[5:] # Remove '/key/'
value = node.get(key)
if value is not None:
self.send_json_response(200, {'key': key, 'value': value, 'node_role': node.role})
else:
self.send_json_response(404, {'error': 'Key not found', 'key': key})
return
self.send_json_response(404, {'error': 'Not found'})
def do_PUT(self):
"""Handle PUT requests (set value)."""
parsed = urlparse(self.path)
# PUT /key/{key}
if parsed.path.startswith('/key/'):
key = parsed.path[5:]
if node.role != 'leader':
self.send_json_response(503, {
'error': 'Not the leader',
'current_role': node.role,
'leader_id': node.leader_id or 'Unknown',
})
return
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
value = json.loads(body)
node.set(key, value)
self.send_json_response(200, {'success': True, 'key': key, 'value': value, 'leader_id': node.node_id})
except json.JSONDecodeError:
self.send_json_response(400, {'error': 'Invalid JSON'})
return
self.send_json_response(404, {'error': 'Not found'})
def do_DELETE(self):
"""Handle DELETE requests."""
parsed = urlparse(self.path)
# DELETE /key/{key}
if parsed.path.startswith('/key/'):
key = parsed.path[5:]
if node.role != 'leader':
self.send_json_response(503, {
'error': 'Not the leader',
'current_role': node.role,
'leader_id': node.leader_id or 'Unknown',
})
return
existed = node.delete(key)
if existed:
self.send_json_response(200, {'success': True, 'key': key, 'leader_id': node.node_id})
else:
self.send_json_response(404, {'error': 'Key not found', 'key': key})
return
self.send_json_response(404, {'error': 'Not found'})
def log_message(self, format, *args):
"""Suppress default logging."""
pass
def run_server(port: int):
"""Start the HTTP server."""
server_address = ('', port)
httpd = HTTPServer(server_address, NodeHandler)
print(f"[{config['node_id']}] Store Node listening on port {port}")
print(f"[{config['node_id']}] Peers: {', '.join(config['peers']) or 'none'}")
print(f"[{config['node_id']}] Available endpoints:")
print(f" GET /status - Node status and role")
print(f" GET /key/{{key}} - Get value")
print(f" PUT /key/{{key}} - Set value (leader only)")
print(f" DEL /key/{{key}} - Delete key (leader only)")
httpd.serve_forever()
if __name__ == '__main__':
run_server(config['port'])
replicated-store-py/requirements.txt
# No external dependencies - uses standard library only
replicated-store-py/Dockerfile
FROM python:3.11-alpine
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 4000
CMD ["python", "src/node.py"]
Docker Compose Setup
TypeScript Version
examples/02-store/ts/docker-compose.yml
version: '3.8'
services:
node1:
build: .
container_name: store-ts-node1
ports:
- "4001:4000"
environment:
- NODE_ID=node-1
- PORT=4000
- PEERS=http://node2:4000,http://node3:4000
networks:
- store-network
node2:
build: .
container_name: store-ts-node2
ports:
- "4002:4000"
environment:
- NODE_ID=node-2
- PORT=4000
- PEERS=http://node1:4000,http://node3:4000
networks:
- store-network
node3:
build: .
container_name: store-ts-node3
ports:
- "4003:4000"
environment:
- NODE_ID=node-3
- PORT=4000
- PEERS=http://node1:4000,http://node2:4000
networks:
- store-network
networks:
store-network:
driver: bridge
Python Version
examples/02-store/py/docker-compose.yml
version: '3.8'
services:
node1:
build: .
container_name: store-py-node1
ports:
- "4001:4000"
environment:
- NODE_ID=node-1
- PORT=4000
- PEERS=http://node2:4000,http://node3:4000
networks:
- store-network
node2:
build: .
container_name: store-py-node2
ports:
- "4002:4000"
environment:
- NODE_ID=node-2
- PORT=4000
- PEERS=http://node1:4000,http://node3:4000
networks:
- store-network
node3:
build: .
container_name: store-py-node3
ports:
- "4003:4000"
environment:
- NODE_ID=node-3
- PORT=4000
- PEERS=http://node1:4000,http://node2:4000
networks:
- store-network
networks:
store-network:
driver: bridge
Running the Example
Step 1: Start the 3-Node Cluster
TypeScript:
cd distributed-systems-course/examples/02-store/ts
docker-compose up --build
Python:
cd distributed-systems-course/examples/02-store/py
docker-compose up --build
You should see leader election happen automatically:
store-ts-node1 | [node-1] Store Node listening on port 4000
store-ts-node2 | [node-2] Store Node listening on port 4000
store-ts-node3 | [node-3] Store Node listening on port 4000
store-ts-node1 | [node-1] 👑 Became LEADER for term 1
store-ts-node2 | [node-2] Waiting for node-1 to become leader
store-ts-node3 | [node-3] Waiting for node-1 to become leader
Step 2: Check Node Status
# Check all nodes
curl http://localhost:4001/status
curl http://localhost:4002/status
curl http://localhost:4003/status
Response from node-1 (leader):
{
"nodeId": "node-1",
"role": "leader",
"term": 1,
"leaderId": "node-1",
"totalKeys": 0,
"keys": []
}
Response from node-2 (follower):
{
"nodeId": "node-2",
"role": "follower",
"term": 1,
"leaderId": "node-1",
"totalKeys": 0,
"keys": []
}
Step 3: Write to Leader
# Write to leader (node-1)
curl -X PUT http://localhost:4001/key/name \
-H "Content-Type: application/json" \
-d '"Alice"'
curl -X PUT http://localhost:4001/key/age \
-H "Content-Type: application/json" \
-d '30'
curl -X PUT http://localhost:4001/key/city \
-H "Content-Type: application/json" \
-d '"NYC"'
Response:
{
"success": true,
"key": "name",
"value": "Alice",
"leaderId": "node-1"
}
Step 4: Read from Followers
Data should be replicated to all followers:
curl http://localhost:4002/key/name
curl http://localhost:4003/key/city
Response:
{
"key": "name",
"value": "Alice",
"nodeRole": "follower"
}
Step 5: Try Writing to Follower (Should Fail)
curl -X PUT http://localhost:4002/key/test \
-H "Content-Type: application/json" \
-d '"should fail"'
Response:
{
"error": "Not the leader",
"currentRole": "follower",
"leaderId": "node-1"
}
Step 6: Simulate Leader Failure
# In a separate terminal, stop the leader
docker-compose stop node1
# Check node-2 status - should become new leader
curl http://localhost:4002/status
After a few seconds:
store-ts-node2 | [node-2] Election timeout! Starting election...
store-ts-node2 | [node-2] 👑 Became LEADER for term 2
store-ts-node3 | [node-3] Waiting for node-2 to become leader
Step 7: Write to New Leader
# Now node-2 is the leader
curl -X PUT http://localhost:4002/key/newleader \
-H "Content-Type: application/json" \
-d '"node-2"'
Step 8: Restart Old Leader
# Restart node-1
docker-compose start node1
# Check status - should become follower
curl http://localhost:4001/status
Response:
{
"nodeId": "node-1",
"role": "follower",
"term": 2,
"leaderId": "node-2",
...
}
System Architecture
graph TB
subgraph "3-Node Replicated Store"
Clients["Clients"]
N1["Node 1<br/>👑 Leader"]
N2["Node 2<br/>Follower"]
N3["Node 3<br/>Follower"]
Clients -->|"Write"| N1
Clients -->|"Read"| N1
Clients -->|"Read"| N2
Clients -->|"Read"| N3
N1 <-->|"Heartbeat<br/>Replication"| N2
N1 <-->|"Heartbeat<br/>Replication"| N3
end
style N1 fill:#6f6,stroke:#333,stroke-width:3px
Exercises
Exercise 1: Test Fault Tolerance
- Start the cluster and write some data
- Stop different nodes one at a time
- Verify the system continues operating
- What happens when you stop 2 out of 3 nodes?
Exercise 2: Observe Replication Lag
- Add a small delay (e.g., 100ms) to replication
- Write data to leader
- Immediately read from follower
- What do you see? This demonstrates eventual consistency.
Exercise 3: Improve Leader Election
The current election is very simple. Try improving it:
- Add random election timeouts (like Raft)
- Implement actual voting (not just lowest ID)
- Add pre-vote to prevent disrupting current leader
Summary
Key Takeaways
- Replication copies data across multiple nodes for fault tolerance
- Single-leader replication is simple but all writes go through leader
- Leader election ensures a new leader is chosen when current leader fails
- Asynchronous replication is fast but can lose data on leader failure
- Read-your-writes consistency is NOT guaranteed when reading from followers
Trade-offs
| Approach | Pros | Cons |
|---|---|---|
| Single-leader | Simple, strong consistency | Leader is bottleneck, single point of failure |
| Multi-leader | No bottleneck, writes anywhere | Complex conflict resolution |
| Sync replication | No data loss | Slow writes, blocking |
| Async replication | Fast writes | Data loss possible, stale reads |
Check Your Understanding
- Why do we replicate data?
- What's the difference between leader and follower?
- What happens when a client tries to write to a follower?
- How does leader election work in our implementation?
- What's the trade-off between sync and async replication?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
We have replication working, but our consistency model is basic. Let's explore consistency levels: Consistency Models (Session 5)
Consistency Models
Session 5 - Full session
Learning Objectives
- Understand different consistency models in distributed systems
- Learn the trade-offs between strong and eventual consistency
- Implement configurable consistency levels in a replicated store
- Experience the effects of consistency levels through hands-on exercises
What is Consistency?
In a replicated store, consistency defines what guarantees you have about the data you read. When data is copied across multiple nodes, you might not always see the latest write immediately.
graph TB
subgraph "Write Happens"
C[Client]
L[Leader]
L -->|Write "name = Alice"| L
end
subgraph "But What Do You Read?"
F1[Follower 1<br/>name = Alice]
F2[Follower 2<br/>name = ???]
F3[Follower 3<br/>name = ???]
C -->|Read| F1
C -->|Read| F2
C -->|Read| F3
end
The question: If you read from a follower, will you see "Alice" or the old value?
The answer depends on your consistency model.
Consistency Spectrum
Consistency models exist on a spectrum from strongest to weakest:
graph LR
A[Strong<br/>Consistency]
B[Read Your Writes]
C[Monotonic Reads]
D[Causal Consistency]
E[Eventual<br/>Consistency]
A ====> B ====> C ====> D ====> E
style A fill:#6f6
style B fill:#9f6
style C fill:#cf6
style D fill:#ff6
style E fill:#f96
Strong Consistency
Definition: Every read receives the most recent write or an error.
sequenceDiagram
participant C as Client
participant L as Leader
participant F as Follower
Note over C,F: Time flows downward
C->>L: SET name = "Alice"
L->>L: Write confirmed
Note over C,F: Strong consistency requires:
Note over C,F: Waiting for replication...
L->>F: Replicate: name = "Alice"
F->>L: ACK
L->>C: Response: Success
C->>F: GET name
F->>C: "Alice" (always latest!)
Characteristics:
- Readers always see the latest data
- No stale reads possible
- Slower performance (must wait for replication)
- Simple mental model
When to use: Financial transactions, inventory management, critical operations
Eventual Consistency
Definition: If no new updates are made, eventually all accesses will return the last updated value.
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
Note over C,F2: Time flows downward
C->>L: SET name = "Alice"
L->>C: Response: Success (immediate!)
Note over C,F2: Leader hasn't replicated yet...
C->>F1: GET name
F1->>C: "Alice" (replicated!)
C->>F2: GET name
F2->>C: "Bob" (stale!)
Note over C,F2: A moment later...
L->>F2: Replicate: name = "Alice"
C->>F2: GET name
F2->>C: "Alice" (updated!)
Characteristics:
- Reads are fast (no waiting for replication)
- You might see stale data
- Eventually, all nodes converge
- More complex mental model
When to use: Social media feeds, product recommendations, analytics
Read-Your-Writes Consistency
A middle ground: you always see your own writes, but might not see others' writes immediately.
sequenceDiagram
participant C1 as Client 1
participant C2 as Client 2
participant L as Leader
participant F as Follower
C1->>L: SET name = "Alice"
L->>C1: Success
C1->>F: GET name
Note over C1,F: Read-your-writes:<br/>C1 sees "Alice"
F->>C1: "Alice"
C2->>F: GET name
Note over C2,F: Might see stale data
F->>C2: "Bob" (stale!)
The CAP Theorem Revisited
You learned about CAP in Session 4. Let's connect it to consistency:
| Combination | Consistency Model | Example Systems |
|---|---|---|
| CP | Strong consistency | ZooKeeper, etcd, MongoDB (with w:majority) |
| AP | Eventual consistency | Cassandra, DynamoDB, CouchDB |
| CA (impossible at scale) | Strong consistency | Single-node databases (RDBMS) |
Quorum-Based Consistency
A practical way to control consistency is using quorums. A quorum is a majority of nodes.
graph TB
subgraph "3-Node Cluster"
N1[Node 1]
N2[Node 2]
N3[Node 3]
Q[Quorum = 2<br/>⌈3/2⌉ = 2]
end
N1 -.-> Q
N2 -.-> Q
N3 -.-> Q
style Q fill:#6f6,stroke:#333,stroke-width:3px
Write Quorum (W)
Number of nodes that must acknowledge a write:
W > N/2 → Strong consistency (majority)
W = 1 → Fast but weak consistency
W = N → Strongest but slowest
Read Quorum (R)
Number of nodes to query and compare for a read:
R + W > N → Strong consistency guaranteed
R + W ≤ N → Eventual consistency
Consistency Levels
| R + W | Consistency | Performance | Use Case |
|---|---|---|---|
| N + 1 > N (impossible) | Strongest | Slow | Critical data |
| R + W > N | Strong | Medium | Banking, inventory |
| R + W ≤ N | Eventual | Fast | Social media, cache |
Implementation
We'll extend our replicated store from Session 4 to support configurable consistency levels.
TypeScript Implementation
Project Structure:
consistent-store-ts/
├── package.json
├── tsconfig.json
├── Dockerfile
├── docker-compose.yml
└── src/
└── node.ts # Node with configurable consistency
consistent-store-ts/src/node.ts
import http from 'http';
/**
* Node configuration
*/
const config = {
nodeId: process.env.NODE_ID || 'node-1',
port: parseInt(process.env.PORT || '4000'),
peers: (process.env.PEERS || '').split(',').filter(Boolean),
heartbeatInterval: 2000,
electionTimeout: 6000,
// Consistency settings
writeQuorum: parseInt(process.env.WRITE_QUORUM || '2'), // W
readQuorum: parseInt(process.env.READ_QUORUM || '1'), // R
};
type NodeRole = 'leader' | 'follower' | 'candidate';
type ConsistencyLevel = 'strong' | 'eventual' | 'read_your_writes';
/**
* Replicated Store Node with Configurable Consistency
*/
class StoreNode {
public nodeId: string;
public role: NodeRole;
public term: number;
public data: Map<string, any>;
public peers: string[];
private leaderId: string | null;
private lastHeartbeat: number;
private heartbeatTimer?: NodeJS.Timeout;
private electionTimer?: NodeJS.Timeout;
private pendingWrites: Map<string, any[]>; // For read-your-writes
constructor(nodeId: string, peers: string[]) {
this.nodeId = nodeId;
this.role = 'follower';
this.term = 0;
this.data = new Map();
this.peers = peers;
this.leaderId = null;
this.lastHeartbeat = Date.now();
this.pendingWrites = new Map();
this.startElectionTimer();
this.startHeartbeat();
}
/**
* Start election timeout timer
*/
private startElectionTimer() {
this.electionTimer = setTimeout(() => {
const timeSinceHeartbeat = Date.now() - this.lastHeartbeat;
if (timeSinceHeartbeat > config.electionTimeout && this.role !== 'leader') {
console.log(`[${this.nodeId}] Election timeout! Starting election...`);
this.startElection();
}
this.startElectionTimer();
}, config.electionTimeout);
}
/**
* Start leader election
*/
private startElection() {
this.term++;
this.role = 'candidate';
const allNodes = [this.nodeId, ...this.peers].sort();
const lowestNode = allNodes[0];
if (this.nodeId === lowestNode) {
this.becomeLeader();
} else {
this.role = 'follower';
this.leaderId = lowestNode;
console.log(`[${this.nodeId}] Waiting for ${lowestNode} to become leader`);
}
}
/**
* Become the leader
*/
private becomeLeader() {
this.role = 'leader';
this.leaderId = this.nodeId;
console.log(`[${this.nodeId}] 👑 Became LEADER for term ${this.term}`);
this.replicateToFollowers();
}
/**
* Start heartbeat to followers
*/
private startHeartbeat() {
this.heartbeatTimer = setInterval(() => {
if (this.role === 'leader') {
this.sendHeartbeat();
}
}, config.heartbeatInterval);
}
/**
* Send heartbeat to all followers
*/
private sendHeartbeat() {
const heartbeat = {
type: 'heartbeat',
leaderId: this.nodeId,
term: this.term,
timestamp: Date.now(),
};
this.peers.forEach(peerUrl => {
this.sendToPeer(peerUrl, '/internal/heartbeat', heartbeat)
.catch(err => console.log(`[${this.nodeId}] Failed heartbeat to ${peerUrl}`));
});
}
/**
* Replicate data to followers with quorum acknowledgment
*/
private async replicateToFollowers(): Promise<boolean> {
const dataObj = Object.fromEntries(this.data);
// Send to all followers in parallel
const promises = this.peers.map(peerUrl =>
this.sendToPeer(peerUrl, '/internal/replicate', {
type: 'replicate',
leaderId: this.nodeId,
term: this.term,
data: dataObj,
}).catch(err => {
console.log(`[${this.nodeId}] Replication failed to ${peerUrl}`);
return false;
})
);
// Wait for all to complete
const results = await Promise.all(promises);
// Count successes (this node counts as 1)
const successes = results.filter(r => r !== false).length + 1;
// Check if we achieved write quorum
const achievedQuorum = successes >= config.writeQuorum;
console.log(`[${this.nodeId}] Replication: ${successes}/${this.peers.length + 1} nodes (W=${config.writeQuorum})`);
return achievedQuorum;
}
/**
* Handle heartbeat from leader
*/
handleHeartbeat(heartbeat: any) {
if (heartbeat.term >= this.term) {
this.term = heartbeat.term;
this.lastHeartbeat = Date.now();
this.leaderId = heartbeat.leaderId;
if (this.role !== 'follower') {
this.role = 'follower';
}
}
}
/**
* Handle replication from leader
*/
handleReplication(message: any) {
if (message.term >= this.term) {
this.term = message.term;
this.leaderId = message.leaderId;
this.role = 'follower';
this.lastHeartbeat = Date.now();
Object.entries(message.data).forEach(([key, value]) => {
this.data.set(key, value);
});
}
}
/**
* Send data to peer node
*/
private async sendToPeer(peerUrl: string, path: string, data: any): Promise<void> {
return new Promise((resolve, reject) => {
const url = new URL(path, peerUrl);
const options = {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
};
const req = http.request(url, options, (res) => {
if (res.statusCode === 200) {
resolve();
} else {
reject(new Error(`Status ${res.statusCode}`));
}
});
req.on('error', reject);
req.write(JSON.stringify(data));
req.end();
});
}
/**
* Set a key-value pair with quorum acknowledgment
*/
async set(key: string, value: any): Promise<{ success: boolean; achievedQuorum: boolean }> {
if (this.role !== 'leader') {
return { success: false, achievedQuorum: false };
}
this.data.set(key, value);
console.log(`[${this.nodeId}] SET ${key} = ${JSON.stringify(value)}`);
// Replicate to followers
const achievedQuorum = await this.replicateToFollowers();
return { success: true, achievedQuorum };
}
/**
* Get a value with configurable consistency
*/
async get(key: string, consistency: ConsistencyLevel = 'eventual'): Promise<any> {
const localValue = this.data.get(key);
// For eventual consistency, return local value immediately
if (consistency === 'eventual') {
console.log(`[${this.nodeId}] GET ${key} => ${JSON.stringify(localValue)} (eventual)`);
return localValue;
}
// For strong consistency, query quorum of nodes
if (consistency === 'strong') {
const values = await this.getFromQuorum(key);
console.log(`[${this.nodeId}] GET ${key} => ${JSON.stringify(values.latest)} (strong from ${values.responses} nodes)`);
return values.latest;
}
// For read-your-writes, check pending writes
if (consistency === 'read_your_writes') {
const pending = this.pendingWrites.get(key);
const valueToReturn = pending && pending.length > 0 ? pending[pending.length - 1] : localValue;
console.log(`[${this.nodeId}] GET ${key} => ${JSON.stringify(valueToReturn)} (read-your-writes)`);
return valueToReturn;
}
return localValue;
}
/**
* Query quorum of nodes and return most recent value
*/
private async getFromQuorum(key: string): Promise<{ latest: any; responses: number }> {
// Query all peers
const promises = this.peers.map(peerUrl =>
this.queryPeer(peerUrl, '/internal/get', { key })
.then(result => ({ success: true, value: result.value, version: result.version || 0 }))
.catch(err => {
console.log(`[${this.nodeId}] Query failed to ${peerUrl}`);
return { success: false, value: null, version: 0 };
})
);
const results = await Promise.all(promises);
// Add local value
results.push({
success: true,
value: this.data.get(key),
version: this.data.has(key) ? 1 : 0,
});
// Count successful responses
const successful = results.filter(r => r.success);
// Return if we have read quorum
if (successful.length >= config.readQuorum) {
// Return most recent value (simple version: first non-null)
const latest = successful.find(r => r.value !== undefined)?.value;
return { latest, responses: successful.length };
}
// Fallback to local value
return { latest: this.data.get(key), responses: successful.length };
}
/**
* Query a peer for a key
*/
private async queryPeer(peerUrl: string, path: string, data: any): Promise<any> {
return new Promise((resolve, reject) => {
const url = new URL(path, peerUrl);
const options = {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
};
const req = http.request(url, options, (res) => {
let body = '';
res.on('data', chunk => body += chunk);
res.on('end', () => {
if (res.statusCode === 200) {
resolve(JSON.parse(body));
} else {
reject(new Error(`Status ${res.statusCode}`));
}
});
});
req.on('error', reject);
req.write(JSON.stringify(data));
req.end();
});
}
/**
* Delete a key
*/
async delete(key: string): Promise<{ success: boolean; achievedQuorum: boolean }> {
if (this.role !== 'leader') {
return { success: false, achievedQuorum: false };
}
const existed = this.data.delete(key);
console.log(`[${this.nodeId}] DELETE ${key}`);
await this.replicateToFollowers();
return { success: existed, achievedQuorum: true };
}
/**
* Get node status
*/
getStatus() {
return {
nodeId: this.nodeId,
role: this.role,
term: this.term,
leaderId: this.leaderId,
totalKeys: this.data.size,
keys: Array.from(this.data.keys()),
config: {
writeQuorum: config.writeQuorum,
readQuorum: config.readQuorum,
totalNodes: this.peers.length + 1,
},
};
}
}
// Create the node
const node = new StoreNode(config.nodeId, config.peers);
/**
* HTTP Server
*/
const server = http.createServer((req, res) => {
res.setHeader('Content-Type', 'application/json');
res.setHeader('Access-Control-Allow-Origin', '*');
res.setHeader('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS');
res.setHeader('Access-Control-Allow-Headers', 'Content-Type');
if (req.method === 'OPTIONS') {
res.writeHead(200);
res.end();
return;
}
const url = new URL(req.url || '', `http://${req.headers.host}`);
// Route: POST /internal/heartbeat
if (req.method === 'POST' && url.pathname === '/internal/heartbeat') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const heartbeat = JSON.parse(body);
node.handleHeartbeat(heartbeat);
res.writeHead(200);
res.end(JSON.stringify({ success: true }));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid request' }));
}
});
return;
}
// Route: POST /internal/replicate
if (req.method === 'POST' && url.pathname === '/internal/replicate') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const message = JSON.parse(body);
node.handleReplication(message);
res.writeHead(200);
res.end(JSON.stringify({ success: true }));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid request' }));
}
});
return;
}
// Route: POST /internal/get - Internal query for quorum reads
if (req.method === 'POST' && url.pathname === '/internal/get') {
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const { key } = JSON.parse(body);
const value = node.data.get(key);
res.writeHead(200);
res.end(JSON.stringify({ value, version: value !== undefined ? 1 : 0 }));
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid request' }));
}
});
return;
}
// Route: GET /status
if (req.method === 'GET' && url.pathname === '/status') {
res.writeHead(200);
res.end(JSON.stringify(node.getStatus()));
return;
}
// Route: GET /key/{key}?consistency=strong|eventual|read_your_writes
if (req.method === 'GET' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5);
const consistency = (url.searchParams.get('consistency') || 'eventual') as ConsistencyLevel;
node.get(key, consistency).then(value => {
if (value !== undefined) {
res.writeHead(200);
res.end(JSON.stringify({ key, value, nodeRole: node.role, consistency }));
} else {
res.writeHead(404);
res.end(JSON.stringify({ error: 'Key not found', key }));
}
});
return;
}
// Route: PUT /key/{key}
if (req.method === 'PUT' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5);
if (node.role !== 'leader') {
res.writeHead(503);
res.end(JSON.stringify({
error: 'Not the leader',
currentRole: node.role,
leaderId: node.leaderId || 'Unknown',
}));
return;
}
let body = '';
req.on('data', chunk => body += chunk);
req.on('end', () => {
try {
const value = JSON.parse(body);
node.set(key, value).then(result => {
res.writeHead(200);
res.end(JSON.stringify({
success: result.success,
key,
value,
leaderId: node.nodeId,
achievedQuorum: result.achievedQuorum,
writeQuorum: config.writeQuorum,
}));
});
} catch (error) {
res.writeHead(400);
res.end(JSON.stringify({ error: 'Invalid JSON' }));
}
});
return;
}
// Route: DELETE /key/{key}
if (req.method === 'DELETE' && url.pathname.startsWith('/key/')) {
const key = url.pathname.slice(5);
if (node.role !== 'leader') {
res.writeHead(503);
res.end(JSON.stringify({
error: 'Not the leader',
currentRole: node.role,
leaderId: node.leaderId || 'Unknown',
}));
return;
}
node.delete(key).then(result => {
if (result.success) {
res.writeHead(200);
res.end(JSON.stringify({ success: true, key, leaderId: node.nodeId }));
} else {
res.writeHead(404);
res.end(JSON.stringify({ error: 'Key not found', key }));
}
});
return;
}
// 404
res.writeHead(404);
res.end(JSON.stringify({ error: 'Not found' }));
});
server.listen(config.port, () => {
console.log(`[${config.nodeId}] Consistent Store listening on port ${config.port}`);
console.log(`[${config.nodeId}] Write Quorum (W): ${config.writeQuorum}, Read Quorum (R): ${config.readQuorum}`);
console.log(`[${config.nodeId}] Peers: ${config.peers.join(', ') || 'none'}`);
console.log(`[${config.nodeId}] Available endpoints:`);
console.log(` GET /status - Node status`);
console.log(` GET /key/{key}?consistency=level - Get with consistency level`);
console.log(` PUT /key/{key} - Set value (leader only)`);
console.log(` DEL /key/{key} - Delete key (leader only)`);
});
consistent-store-ts/package.json
{
"name": "consistent-store-ts",
"version": "1.0.0",
"description": "Replicated key-value store with configurable consistency",
"main": "dist/node.js",
"scripts": {
"build": "tsc",
"start": "node dist/node.js",
"dev": "ts-node src/node.ts"
},
"dependencies": {},
"devDependencies": {
"@types/node": "^20.0.0",
"typescript": "^5.0.0",
"ts-node": "^10.9.0"
}
}
consistent-store-ts/tsconfig.json
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true
},
"include": ["src/**/*"]
}
consistent-store-ts/Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 4000
CMD ["npm", "start"]
Python Implementation
consistent-store-py/src/node.py
import os
import json
import time
import threading
import asyncio
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Any, Dict, List, Optional, Literal
from urllib.parse import urlparse, parse_qs
from urllib.request import Request, urlopen
from urllib.error import URLError
ConsistencyLevel = Literal['strong', 'eventual', 'read_your_writes']
class StoreNode:
"""Replicated store node with configurable consistency."""
def __init__(self, node_id: str, peers: List[str]):
self.node_id = node_id
self.role: str = 'follower'
self.term = 0
self.data: Dict[str, Any] = {}
self.peers = peers
self.leader_id: Optional[str] = None
self.last_heartbeat = time.time()
self.pending_writes: Dict[str, List[Any]] = {}
# Configuration
self.heartbeat_interval = 2.0
self.election_timeout = 6.0
self.write_quorum = int(os.environ.get('WRITE_QUORUM', '2'))
self.read_quorum = int(os.environ.get('READ_QUORUM', '1'))
# Start timers
self.start_election_timer()
self.start_heartbeat_thread()
def start_election_timer(self):
"""Start election timeout timer."""
def election_timer():
while True:
time.sleep(1)
time_since = time.time() - self.last_heartbeat
if time_since > self.election_timeout and self.role != 'leader':
print(f"[{self.node_id}] Election timeout! Starting election...")
self.start_election()
thread = threading.Thread(target=election_timer, daemon=True)
thread.start()
def start_election(self):
"""Start leader election."""
self.term += 1
self.role = 'candidate'
all_nodes = sorted([self.node_id] + self.peers)
lowest_node = all_nodes[0]
if self.node_id == lowest_node:
self.become_leader()
else:
self.role = 'follower'
self.leader_id = lowest_node
print(f"[{self.node_id}] Waiting for {lowest_node} to become leader")
def become_leader(self):
"""Become the leader."""
self.role = 'leader'
self.leader_id = self.node_id
print(f"[{self.node_id}] 👑 Became LEADER for term {self.term}")
self.replicate_to_followers()
def start_heartbeat_thread(self):
"""Start heartbeat to followers."""
def heartbeat_loop():
while True:
time.sleep(self.heartbeat_interval)
if self.role == 'leader':
self.send_heartbeat()
thread = threading.Thread(target=heartbeat_loop, daemon=True)
thread.start()
def send_heartbeat(self):
"""Send heartbeat to all followers."""
heartbeat = {
'type': 'heartbeat',
'leader_id': self.node_id,
'term': self.term,
'timestamp': int(time.time() * 1000),
}
for peer in self.peers:
try:
self.send_to_peer(peer, '/internal/heartbeat', heartbeat)
except Exception as e:
print(f"[{self.node_id}] Failed heartbeat to {peer}: {e}")
def replicate_to_followers(self) -> bool:
"""Replicate data to followers and check quorum."""
message = {
'type': 'replicate',
'leader_id': self.node_id,
'term': self.term,
'data': self.data,
}
successes = 1 # This node counts
for peer in self.peers:
try:
self.send_to_peer(peer, '/internal/replicate', message)
successes += 1
except Exception as e:
print(f"[{self.node_id}] Replication failed to {peer}: {e}")
achieved_quorum = successes >= self.write_quorum
print(f"[{self.node_id}] Replication: {successes}/{len(self.peers) + 1} nodes (W={self.write_quorum})")
return achieved_quorum
def handle_heartbeat(self, heartbeat: dict):
"""Handle heartbeat from leader."""
if heartbeat['term'] >= self.term:
self.term = heartbeat['term']
self.last_heartbeat = time.time()
self.leader_id = heartbeat['leader_id']
if self.role != 'follower':
self.role = 'follower'
def handle_replication(self, message: dict):
"""Handle replication from leader."""
if message['term'] >= self.term:
self.term = message['term']
self.leader_id = message['leader_id']
self.role = 'follower'
self.last_heartbeat = time.time()
self.data.update(message['data'])
def send_to_peer(self, peer_url: str, path: str, data: dict) -> None:
"""Send data to peer node."""
url = f"{peer_url}{path}"
body = json.dumps(data).encode('utf-8')
req = Request(url, data=body, headers={'Content-Type': 'application/json'}, method='POST')
with urlopen(req, timeout=1) as response:
if response.status != 200:
raise Exception(f"Status {response.status}")
def set(self, key: str, value: Any) -> Dict[str, Any]:
"""Set a key-value pair with quorum acknowledgment."""
if self.role != 'leader':
return {'success': False, 'achieved_quorum': False}
self.data[key] = value
print(f"[{self.node_id}] SET {key} = {json.dumps(value)}")
achieved_quorum = self.replicate_to_followers()
return {'success': True, 'achieved_quorum': achieved_quorum}
def get(self, key: str, consistency: ConsistencyLevel = 'eventual') -> Any:
"""Get a value with configurable consistency."""
local_value = self.data.get(key)
if consistency == 'eventual':
print(f"[{self.node_id}] GET {key} => {json.dumps(local_value)} (eventual)")
return local_value
if consistency == 'strong':
latest, responses = self.get_from_quorum(key)
print(f"[{self.node_id}] GET {key} => {json.dumps(latest)} (strong from {responses} nodes)")
return latest
if consistency == 'read_your_writes':
pending = self.pending_writes.get(key, [])
value_to_return = pending[-1] if pending else local_value
print(f"[{self.node_id}] GET {key} => {json.dumps(value_to_return)} (read-your-writes)")
return value_to_return
return local_value
def get_from_quorum(self, key: str) -> tuple:
"""Query quorum of nodes and return most recent value."""
results = []
# Query all peers
for peer in self.peers:
try:
result = self.query_peer(peer, '/internal/get', {'key': key})
results.append({
'success': True,
'value': result.get('value'),
'version': result.get('version', 0),
})
except Exception as e:
print(f"[{self.node_id}] Query failed to {peer}: {e}")
results.append({'success': False, 'value': None, 'version': 0})
# Add local value
results.append({
'success': True,
'value': self.data.get(key),
'version': 1 if key in self.data else 0,
})
# Filter successful responses
successful = [r for r in results if r['success']]
if len(successful) >= self.read_quorum:
# Return first non-null value
for r in successful:
if r['value'] is not None:
return r['value'], len(successful)
return self.data.get(key), len(successful)
def query_peer(self, peer_url: str, path: str, data: dict) -> dict:
"""Query a peer for a key."""
url = f"{peer_url}{path}"
body = json.dumps(data).encode('utf-8')
req = Request(url, data=body, headers={'Content-Type': 'application/json'}, method='POST')
with urlopen(req, timeout=1) as response:
if response.status == 200:
return json.loads(response.read().decode('utf-8'))
raise Exception(f"Status {response.status}")
def delete(self, key: str) -> Dict[str, Any]:
"""Delete a key."""
if self.role != 'leader':
return {'success': False, 'achieved_quorum': False}
existed = key in self.data
if existed:
del self.data[key]
print(f"[{self.node_id}] DELETE {key}")
self.replicate_to_followers()
return {'success': existed, 'achieved_quorum': True}
def get_status(self) -> dict:
"""Get node status."""
return {
'node_id': self.node_id,
'role': self.role,
'term': self.term,
'leader_id': self.leader_id,
'total_keys': len(self.data),
'keys': list(self.data.keys()),
'config': {
'write_quorum': self.write_quorum,
'read_quorum': self.read_quorum,
'total_nodes': len(self.peers) + 1,
},
}
# Create the node
config = {
'node_id': os.environ.get('NODE_ID', 'node-1'),
'port': int(os.environ.get('PORT', '4000')),
'peers': [p for p in os.environ.get('PEERS', '').split(',') if p],
}
node = StoreNode(config['node_id'], config['peers'])
class NodeHandler(BaseHTTPRequestHandler):
"""HTTP request handler for store node."""
def send_json_response(self, status: int, data: dict):
"""Send a JSON response."""
self.send_response(status)
self.send_header('Content-Type', 'application/json')
self.send_header('Access-Control-Allow-Origin', '*')
self.end_headers()
self.wfile.write(json.dumps(data).encode())
def do_OPTIONS(self):
"""Handle CORS preflight."""
self.send_response(200)
self.send_header('Access-Control-Allow-Origin', '*')
self.send_header('Access-Control-Allow-Methods', 'GET, POST, PUT, DELETE, OPTIONS')
self.send_header('Access-Control-Allow-Headers', 'Content-Type')
self.end_headers()
def do_POST(self):
"""Handle POST requests."""
parsed = urlparse(self.path)
if parsed.path == '/internal/heartbeat':
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
heartbeat = json.loads(body)
node.handle_heartbeat(heartbeat)
self.send_json_response(200, {'success': True})
except (json.JSONDecodeError, KeyError):
self.send_json_response(400, {'error': 'Invalid request'})
return
if parsed.path == '/internal/replicate':
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
message = json.loads(body)
node.handle_replication(message)
self.send_json_response(200, {'success': True})
except (json.JSONDecodeError, KeyError):
self.send_json_response(400, {'error': 'Invalid request'})
return
if parsed.path == '/internal/get':
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
data = json.loads(body)
key = data.get('key')
value = node.data.get(key)
self.send_json_response(200, {'value': value, 'version': 1 if value is not None else 0})
except (json.JSONDecodeError, KeyError):
self.send_json_response(400, {'error': 'Invalid request'})
return
self.send_json_response(404, {'error': 'Not found'})
def do_GET(self):
"""Handle GET requests."""
parsed = urlparse(self.path)
if parsed.path == '/status':
self.send_json_response(200, node.get_status())
return
if parsed.path.startswith('/key/'):
key = parsed.path[5:]
consistency = parsed.query.split('=')[-1] if '=' in parsed.query else 'eventual'
if consistency not in ['strong', 'eventual', 'read_your_writes']:
consistency = 'eventual'
value = node.get(key, consistency)
if value is not None:
self.send_json_response(200, {
'key': key,
'value': value,
'node_role': node.role,
'consistency': consistency,
})
else:
self.send_json_response(404, {'error': 'Key not found', 'key': key})
return
self.send_json_response(404, {'error': 'Not found'})
def do_PUT(self):
"""Handle PUT requests."""
parsed = urlparse(self.path)
if parsed.path.startswith('/key/'):
key = parsed.path[5:]
if node.role != 'leader':
self.send_json_response(503, {
'error': 'Not the leader',
'current_role': node.role,
'leader_id': node.leader_id or 'Unknown',
})
return
content_length = int(self.headers.get('Content-Length', 0))
body = self.rfile.read(content_length).decode('utf-8')
try:
value = json.loads(body)
result = node.set(key, value)
self.send_json_response(200, {
'success': result['success'],
'key': key,
'value': value,
'leader_id': node.node_id,
'achieved_quorum': result['achieved_quorum'],
'write_quorum': node.write_quorum,
})
except json.JSONDecodeError:
self.send_json_response(400, {'error': 'Invalid JSON'})
return
self.send_json_response(404, {'error': 'Not found'})
def do_DELETE(self):
"""Handle DELETE requests."""
parsed = urlparse(self.path)
if parsed.path.startswith('/key/'):
key = parsed.path[5:]
if node.role != 'leader':
self.send_json_response(503, {
'error': 'Not the leader',
'current_role': node.role,
'leader_id': node.leader_id or 'Unknown',
})
return
result = node.delete(key)
if result['success']:
self.send_json_response(200, {'success': True, 'key': key, 'leader_id': node.node_id})
else:
self.send_json_response(404, {'error': 'Key not found', 'key': key})
return
self.send_json_response(404, {'error': 'Not found'})
def log_message(self, format, *args):
"""Suppress default logging."""
pass
def run_server(port: int):
"""Start the HTTP server."""
server_address = ('', port)
httpd = HTTPServer(server_address, NodeHandler)
print(f"[{config['node_id']}] Consistent Store listening on port {port}")
print(f"[{config['node_id']}] Write Quorum (W): {node.write_quorum}, Read Quorum (R): {node.read_quorum}")
print(f"[{config['node_id']}] Peers: {', '.join(config['peers']) or 'none'}")
print(f"[{config['node_id']}] Available endpoints:")
print(f" GET /status - Node status")
print(f" GET /key/{{key}}?consistency=level - Get with consistency level")
print(f" PUT /key/{{key}} - Set value (leader only)")
print(f" DEL /key/{{key}} - Delete key (leader only)")
httpd.serve_forever()
if __name__ == '__main__':
run_server(config['port'])
consistent-store-py/requirements.txt
# No external dependencies - uses standard library only
consistent-store-py/Dockerfile
FROM python:3.11-alpine
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 4000
CMD ["python", "src/node.py"]
Docker Compose Setup
TypeScript Version
examples/03-consistent-store/ts/docker-compose.yml
version: '3.8'
services:
node1:
build: .
container_name: consistent-ts-node1
ports:
- "4001:4000"
environment:
- NODE_ID=node-1
- PORT=4000
- PEERS=http://node2:4000,http://node3:4000
- WRITE_QUORUM=2
- READ_QUORUM=1
networks:
- consistent-network
node2:
build: .
container_name: consistent-ts-node2
ports:
- "4002:4000"
environment:
- NODE_ID=node-2
- PORT=4000
- PEERS=http://node1:4000,http://node3:4000
- WRITE_QUORUM=2
- READ_QUORUM=1
networks:
- consistent-network
node3:
build: .
container_name: consistent-ts-node3
ports:
- "4003:4000"
environment:
- NODE_ID=node-3
- PORT=4000
- PEERS=http://node1:4000,http://node2:4000
- WRITE_QUORUM=2
- READ_QUORUM=1
networks:
- consistent-network
networks:
consistent-network:
driver: bridge
Python Version
examples/03-consistent-store/py/docker-compose.yml
version: '3.8'
services:
node1:
build: .
container_name: consistent-py-node1
ports:
- "4001:4000"
environment:
- NODE_ID=node-1
- PORT=4000
- PEERS=http://node2:4000,http://node3:4000
- WRITE_QUORUM=2
- READ_QUORUM=1
networks:
- consistent-network
node2:
build: .
container_name: consistent-py-node2
ports:
- "4002:4000"
environment:
- NODE_ID=node-2
- PORT=4000
- PEERS=http://node1:4000,http://node3:4000
- WRITE_QUORUM=2
- READ_QUORUM=1
networks:
- consistent-network
node3:
build: .
container_name: consistent-py-node3
ports:
- "4003:4000"
environment:
- NODE_ID=node-3
- PORT=4000
- PEERS=http://node1:4000,http://node2:4000
- WRITE_QUORUM=2
- READ_QUORUM=1
networks:
- consistent-network
networks:
consistent-network:
driver: bridge
Running the Example
Step 1: Start the Cluster
TypeScript:
cd distributed-systems-course/examples/03-consistent-store/ts
docker-compose up --build
Python:
cd distributed-systems-course/examples/03-consistent-store/py
docker-compose up --build
You should see:
consistent-ts-node1 | [node-1] 👑 Became LEADER for term 1
consistent-ts-node1 | [node-1] Write Quorum (W): 2, Read Quorum (R): 1
consistent-ts-node2 | [node-2] Waiting for node-1 to become leader
consistent-ts-node3 | [node-3] Waiting for node-1 to become leader
Step 2: Test Eventual Consistency (Default)
# Write to leader
curl -X PUT http://localhost:4001/key/name \
-H "Content-Type: application/json" \
-d '"Alice"'
# Immediately read from follower (eventual consistency)
curl http://localhost:4002/key/name
You might see:
- Immediately after write:
null(follower hasn't received replication yet) - A moment later:
"Alice"(follower caught up)
Step 3: Test Strong Consistency
# Read with strong consistency (waits for quorum)
curl "http://localhost:4002/key/name?consistency=strong"
This queries multiple nodes and returns the latest confirmed value.
Step 4: Observe Quorum Behavior
Check the status to see your quorum settings:
curl http://localhost:4001/status
Response:
{
"nodeId": "node-1",
"role": "leader",
"config": {
"writeQuorum": 2,
"readQuorum": 1,
"totalNodes": 3
}
}
Step 5: Test Different Quorum Settings
Stop the docker-compose and modify the environment variables:
Try W=3 (Strongest):
environment:
- WRITE_QUORUM=3
- READ_QUORUM=1
Try W=1 (Weakest):
environment:
- WRITE_QUORUM=1
- READ_QUORUM=1
Observe how the system behaves differently with each setting.
Consistency Comparison
graph TB
subgraph "Same Data, Different Consistency Levels"
W[Write: name = Alice]
subgraph "Strong Consistency<br/>Slow but Accurate"
S1[Node 1: Alice]
S2[Node 2: Alice]
S3[Node 3: Alice]
R1[Read → Alice]
end
subgraph "Eventual Consistency<br/>Fast but Maybe Stale"
E1[Node 1: Alice]
E2[Node 2: Bob]
E3[Node 3: ???]
R2[Read → Bob or ???]
end
end
W --> S1
W --> S2
W --> S3
W --> E1
W -.->|delayed| E2
W -.->|delayed| E3
style R1 fill:#6f6
style R2 fill:#f96
Exercises
Exercise 1: Experience Eventual Consistency
- Start the cluster
- Write a value to the leader
- Immediately read from a follower (within 100ms)
- What do you see? Is it the new value or old?
Exercise 2: Compare Consistency Levels
Write a script that:
- Sets a key to a new value
- Immediately reads it with
consistency=eventual - Immediately reads it with
consistency=strong - Compare the results
Exercise 3: Adjust Quorum for Different Use Cases
For each scenario, what quorum settings would you choose?
| Scenario | W (Write) | R (Read) | R + W | Consistency | Why? |
|---|---|---|---|---|---|
| Bank balance transfer | ? | ? | ? | ? | |
| Social media like | ? | ? | ? | ? | |
| Shopping cart | ? | ? | ? | ? | |
| User profile view | ? | ? | ? | ? |
Exercise 4: Implement Read Repair
When a stale read is detected, update the stale node with the latest value. Hint: In the strong consistency read, if you find a newer value on one node, send it to nodes with older values.
Summary
Key Takeaways
- Consistency is a spectrum from strong to eventual
- Strong consistency = always see latest data, but slower
- Eventual consistency = fast reads, but might see stale data
- Quorum configuration (W + R) controls consistency level:
R + W > N→ Strong consistencyR + W ≤ N→ Eventual consistency
- Trade-off: You can't have both strong consistency AND high availability (CAP theorem)
Consistency Decision Tree
Need to read latest data immediately?
├─ Yes → Use strong consistency (R + W > N)
│ └─ Accept slower performance
└─ No → Use eventual consistency (R + W ≤ N)
└─ Get faster reads, accept some staleness
Real-World Examples
| System | Default Consistency | Configurable? |
|---|---|---|
| DynamoDB | Eventually consistent | Yes (ConsistentRead parameter) |
| Cassandra | Eventually consistent | Yes (CONSISTENCY level) |
| MongoDB | Strong (w:majority) | Yes (writeConcern, readConcern) |
| CouchDB | Eventually consistent | Yes (r, w parameters) |
| etcd | Strong | No (always strong) |
Check Your Understanding
- What's the difference between strong and eventual consistency?
- How does quorum configuration (R, W) affect consistency?
- When would you choose eventual consistency over strong?
-
What does
R + W > Nguarantee? - Why can't we have both strong consistency and high availability during partitions?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
We've built a replicated store with configurable consistency. Now let's add real-time communication: WebSockets (Session 6)
WebSockets
Session 6, Part 1 - 20 minutes
Learning Objectives
- Understand the WebSocket protocol and its advantages over HTTP
- Learn the WebSocket connection lifecycle
- Implement WebSocket servers and clients in TypeScript and Python
- Handle connection management and error scenarios
Introduction
In previous sessions, we built systems using HTTP—a request-response protocol. The client asks, the server answers. But what if we need real-time, bidirectional communication?
Enter WebSockets: a protocol that enables full-duplex communication over a single TCP connection.
sequenceDiagram
participant Client
participant Server
Note over Client,Server: HTTP Request-Response (Traditional)
Client->>Server: GET /data
Server-->>Client: Response
Client->>Server: GET /data
Server-->>Client: Response
Note over Client,Server: WebSocket (Real-Time)
Client->>Server: HTTP Upgrade Request
Server-->>Client: 101 Switching Protocols
Client->>Server: Message 1
Server-->>Client: Message 2
Client->>Server: Message 3
Server-->>Client: Message 4
Client->>Server: Message 5
WebSocket vs HTTP
| Aspect | HTTP | WebSocket |
|---|---|---|
| Communication | Half-duplex (request-response) | Full-duplex (bidirectional) |
| Connection | New connection per request | Persistent connection |
| Latency | Higher (HTTP overhead) | Lower (frames, not packets) |
| State | Stateless | Stateful connection |
| Server Push | Requires polling/SSE | Native push support |
When to Use WebSockets
Great for:
- Chat applications
- Real-time collaboration (editing, gaming)
- Live dashboards and monitoring
- Multiplayer games
Not ideal for:
- Simple CRUD operations (use REST)
- One-time data fetching
- Stateless resource access
The WebSocket Protocol
Connection Handshake
WebSockets start as HTTP, then upgrade to the WebSocket protocol:
stateDiagram-v2
[*] --> HTTP: Client sends HTTP request
HTTP --> Handshake: Server receives
Handshake --> WebSocket: 101 Switching Protocols
WebSocket --> Connected: Full-duplex established
Connected --> Messaging: Send/receive frames
Messaging --> Closing: Close frame sent
Closing --> [*]: Connection terminated
HTTP Request (Upgrade):
GET /chat HTTP/1.1
Host: server.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
HTTP Response (Accept):
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Frame Structure
WebSocket messages are sent as frames, not HTTP packets:
+--------+--------+--------+--------+ +--------+
| FIN | RSV1-3 | Opcode | Mask | ... | Payload|
| 1 bit | 3 bits | 4 bits | 1 bit | | |
+--------+--------+--------+--------+ +--------+
Common Opcodes:
- 0x1: Text frame
- 0x2: Binary frame
- 0x8: Close connection
- 0x9: Ping
- 0xA: Pong
WebSocket Lifecycle
stateDiagram-v2
[*] --> Connecting: ws://localhost:8080
Connecting --> Open: Handshake complete (101)
Open --> Message: Send/Receive data
Message --> Open: Continue
Open --> Closing: Normal close or error
Closing --> Closed: TCP connection ends
Closed --> [*]
note right of Connecting
Client sends HTTP Upgrade
Server responds with 101
end note
note right of Message
Full-duplex messaging
No overhead per message
end note
note right of Closing
Close frame exchange
Graceful shutdown
end note
Implementation: TypeScript
We'll use the ws library—the de facto standard for WebSockets in Node.js.
Server Implementation
// examples/03-chat/ts/ws-server.ts
import { WebSocketServer, WebSocket } from 'ws';
interface ChatMessage {
type: 'message' | 'join' | 'leave';
username: string;
content: string;
timestamp: number;
}
const wss = new WebSocketServer({ port: 8080 });
const clients = new Map<WebSocket, string>();
console.log('WebSocket server running on ws://localhost:8080');
wss.on('connection', (ws: WebSocket) => {
console.log('New client connected');
// Welcome message
ws.send(JSON.stringify({
type: 'message',
username: 'System',
content: 'Welcome! Please identify yourself.',
timestamp: Date.now()
} as ChatMessage));
// Handle incoming messages
ws.on('message', (data: Buffer) => {
try {
const message: ChatMessage = JSON.parse(data.toString());
if (message.type === 'join') {
// Register username
clients.set(ws, message.username);
console.log(`${message.username} joined`);
// Broadcast to all clients
broadcast({
type: 'message',
username: 'System',
content: `${message.username} has joined the chat`,
timestamp: Date.now()
});
} else if (message.type === 'message') {
const username = clients.get(ws) || 'Anonymous';
console.log(`${username}: ${message.content}`);
// Broadcast the message
broadcast({
type: 'message',
username,
content: message.content,
timestamp: Date.now()
});
}
} catch (error) {
console.error('Invalid message:', error);
}
});
// Handle disconnection
ws.on('close', () => {
const username = clients.get(ws);
if (username) {
console.log(`${username} disconnected`);
clients.delete(ws);
broadcast({
type: 'message',
username: 'System',
content: `${username} has left the chat`,
timestamp: Date.now()
});
}
});
// Handle errors
ws.on('error', (error) => {
console.error('WebSocket error:', error);
});
});
function broadcast(message: ChatMessage): void {
const data = JSON.stringify(message);
wss.clients.forEach((client) => {
if (client.readyState === WebSocket.OPEN) {
client.send(data);
}
});
}
// Heartbeat to detect stale connections
const interval = setInterval(() => {
wss.clients.forEach((ws) => {
if (ws.isAlive === false) {
return ws.terminate();
}
ws.isAlive = false;
ws.ping();
});
}, 30000);
wss.on('close', () => {
clearInterval(interval);
});
Client Implementation
// examples/03-chat/ts/ws-client.ts
import { WebSocket } from 'ws';
interface ChatMessage {
type: 'message' | 'join' | 'leave';
username: string;
content: string;
timestamp: number;
}
class ChatClient {
private ws: WebSocket;
private username: string;
private reconnectAttempts = 0;
private readonly maxReconnectAttempts = 5;
constructor(url: string, username: string) {
this.username = username;
this.ws = this.connect(url);
}
private connect(url: string): WebSocket {
const ws = new WebSocket(url);
ws.on('open', () => {
console.log('Connected to chat server');
this.reconnectAttempts = 0;
// Identify ourselves
this.send({
type: 'join',
username: this.username,
content: '',
timestamp: Date.now()
});
});
ws.on('message', (data: Buffer) => {
const message: ChatMessage = JSON.parse(data.toString());
this.displayMessage(message);
});
ws.on('close', () => {
console.log('Disconnected from server');
// Attempt reconnection
if (this.reconnectAttempts < this.maxReconnectAttempts) {
this.reconnectAttempts++;
const delay = Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000);
console.log(`Reconnecting in ${delay}ms... (attempt ${this.reconnectAttempts})`);
setTimeout(() => {
this.ws = this.connect(url);
}, delay);
}
});
ws.on('error', (error) => {
console.error('WebSocket error:', error.message);
});
// Respond to pings
ws.on('ping', () => {
ws.pong();
});
return ws;
}
public send(message: ChatMessage): void {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify(message));
} else {
console.error('Cannot send message: connection not open');
}
}
public sendMessage(content: string): void {
this.send({
type: 'message',
username: this.username,
content,
timestamp: Date.now()
});
}
private displayMessage(message: ChatMessage): void {
const time = new Date(message.timestamp).toLocaleTimeString();
console.log(`[${time}] ${message.username}: ${message.content}`);
}
public close(): void {
this.ws.close();
}
}
// CLI interface
const username = process.argv[2] || `User${Math.floor(Math.random() * 1000)}`;
const client = new ChatClient('ws://localhost:8080', username);
console.log(`You are logged in as: ${username}`);
console.log('Type a message and press Enter to send. Press Ctrl+C to exit.');
// Read from stdin
process.stdin.setEncoding('utf8');
process.stdin.on('data', (chunk: Buffer) => {
const text = chunk.toString().trim();
if (text) {
client.sendMessage(text);
}
});
// Handle graceful shutdown
process.on('SIGINT', () => {
console.log('\nShutting down...');
client.close();
process.exit(0);
});
Package Configuration
// examples/03-chat/ts/package.json
{
"name": "chat-websocket-example",
"version": "1.0.0",
"type": "module",
"scripts": {
"server": "node --loader ts-node/esm ws-server.ts",
"client": "node --loader ts-node/esm ws-client.ts"
},
"dependencies": {
"ws": "^8.18.0"
},
"devDependencies": {
"@types/ws": "^8.5.12",
"ts-node": "^10.9.2",
"typescript": "^5.6.3"
}
}
Implementation: Python
We'll use the websockets library—a fully compliant WebSocket implementation.
Server Implementation
# examples/03-chat/py/ws_server.py
import asyncio
import json
import logging
from datetime import datetime
from typing import Set
import websockets
from websockets.server import WebSocketServerProtocol
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Track connected clients
clients: Set[WebSocketServerProtocol] = set()
usernames: dict[WebSocketServerProtocol, str] = {}
async def broadcast(message: dict) -> None:
"""Send a message to all connected clients."""
if clients:
await asyncio.gather(
*[client.send(json.dumps(message)) for client in clients if client.open],
return_exceptions=True
)
async def handle_client(websocket: WebSocketServerProtocol) -> None:
"""Handle a client connection."""
clients.add(websocket)
logger.info(f"New client connected. Total clients: {len(clients)}")
try:
# Send welcome message
welcome_msg = {
"type": "message",
"username": "System",
"content": "Welcome! Please identify yourself.",
"timestamp": datetime.now().timestamp()
}
await websocket.send(json.dumps(welcome_msg))
# Handle messages
async for message in websocket:
try:
data = json.loads(message)
if data.get("type") == "join":
# Register username
username = data.get("username", "Anonymous")
usernames[websocket] = username
logger.info(f"{username} joined")
# Broadcast join notification
await broadcast({
"type": "message",
"username": "System",
"content": f"{username} has joined the chat",
"timestamp": datetime.now().timestamp()
})
elif data.get("type") == "message":
# Broadcast the message
username = usernames.get(websocket, "Anonymous")
content = data.get("content", "")
logger.info(f"{username}: {content}")
await broadcast({
"type": "message",
"username": username,
"content": content,
"timestamp": datetime.now().timestamp()
})
except json.JSONDecodeError:
logger.error("Invalid JSON received")
except Exception as e:
logger.error(f"Error handling message: {e}")
except websockets.exceptions.ConnectionClosed:
logger.info("Client disconnected unexpectedly")
finally:
# Cleanup
username = usernames.get(websocket)
if username:
del usernames[websocket]
await broadcast({
"type": "message",
"username": "System",
"content": f"{username} has left the chat",
"timestamp": datetime.now().timestamp()
})
clients.discard(websocket)
logger.info(f"Client removed. Total clients: {len(clients)}")
async def main():
"""Start the WebSocket server."""
async with websockets.serve(handle_client, "localhost", 8080):
logger.info("WebSocket server running on ws://localhost:8080")
await asyncio.Future() # Run forever
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
logger.info("Server stopped")
Client Implementation
# examples/03-chat/py/ws_client.py
import asyncio
import json
import sys
from datetime import datetime
import websockets
from websockets.client import WebSocketClientProtocol
class ChatClient:
def __init__(self, url: str, username: str):
self.url = url
self.username = username
self.websocket: WebSocketClientProtocol | None = None
self.reconnect_attempts = 0
self.max_reconnect_attempts = 5
async def connect(self) -> None:
"""Connect to the WebSocket server."""
backoff = 1
while self.reconnect_attempts < self.max_reconnect_attempts:
try:
async with websockets.connect(self.url) as ws:
self.websocket = ws
self.reconnect_attempts = 0
print(f"Connected to {self.url}")
# Identify ourselves
await self.send({
"type": "join",
"username": self.username,
"content": "",
"timestamp": datetime.now().timestamp()
})
# Start receiving messages
receive_task = asyncio.create_task(self.receive_messages())
# Wait for connection to close
await ws.wait_closed()
# Cancel receive task
receive_task.cancel()
try:
await receive_task
except asyncio.CancelledError:
pass
print("Disconnected from server")
except (ConnectionRefusedError, OSError) as e:
self.reconnect_attempts += 1
print(f"Connection failed: {e}")
print(f"Reconnecting in {backoff}s... (attempt {self.reconnect_attempts})")
await asyncio.sleep(backoff)
backoff = min(backoff * 2, 30)
print("Max reconnection attempts reached. Giving up.")
async def receive_messages(self) -> None:
"""Receive and display messages from the server."""
if not self.websocket:
return
try:
async for message in self.websocket:
data = json.loads(message)
self.display_message(data)
except asyncio.CancelledError:
pass
except Exception as e:
print(f"Error receiving message: {e}")
async def send(self, message: dict) -> None:
"""Send a message to the server."""
if self.websocket and not self.websocket.closed:
await self.websocket.send(json.dumps(message))
else:
print("Cannot send message: connection not open")
def display_message(self, message: dict) -> None:
"""Display a received message."""
timestamp = datetime.fromtimestamp(message["timestamp"]).strftime("%H:%M:%S")
print(f"[{timestamp}] {message['username']}: {message['content']}")
async def stdin_reader(client: ChatClient):
"""Read from stdin and send messages."""
loop = asyncio.get_event_loop()
while True:
line = await loop.run_in_executor(None, sys.stdin.readline)
text = line.strip()
if text:
await client.send({
"type": "message",
"username": client.username,
"content": text,
"timestamp": datetime.now().timestamp()
})
async def main():
"""Run the chat client."""
username = sys.argv[1] if len(sys.argv) > 1 else f"User{asyncio.get_event_loop().time() % 1000:.0f}"
client = ChatClient("ws://localhost:8080", username)
print(f"You are logged in as: {username}")
print("Type a message and press Enter to send. Press Ctrl+C to exit.")
# Run connection and stdin reader concurrently
connect_task = asyncio.create_task(client.connect())
# Give connection time to establish
await asyncio.sleep(0.5)
stdin_task = asyncio.create_task(stdin_reader(client))
try:
await asyncio.gather(connect_task, stdin_task)
except KeyboardInterrupt:
print("\nShutting down...")
finally:
connect_task.cancel()
stdin_task.cancel()
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
pass
Requirements
# examples/03-chat/py/requirements.txt
websockets==13.1
Docker Compose Setup
TypeScript Version
# examples/03-chat/ts/docker-compose.yml
version: '3.8'
services:
server:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
- NODE_ENV=production
restart: unless-stopped
# examples/03-chat/ts/Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production
COPY . .
RUN npx tsc
EXPOSE 8080
CMD ["node", "dist/ws-server.js"]
Python Version
# examples/03-chat/py/docker-compose.yml
version: '3.8'
services:
server:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8080"
restart: unless-stopped
# examples/03-chat/py/Dockerfile
FROM python:3.12-alpine
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["python", "ws_server.py"]
Running the Examples
TypeScript
# Install dependencies
cd examples/03-chat/ts
npm install
# Start the server
npm run server
# In another terminal, start a client
npm run client Alice
# In another terminal, start another client
npm run client Bob
Python
# Install dependencies
cd examples/03-chat/py
pip install -r requirements.txt
# Start the server
python ws_server.py
# In another terminal, start a client
python ws_client.py Alice
# In another terminal, start another client
python ws_client.py Bob
With Docker
# Start the server
docker-compose up -d
# Check logs
docker-compose logs -f
# Connect with a client (run from host)
npm run client Alice # or python ws_client.py Alice
Connection Management Best Practices
1. Heartbeat/Ping-Pong
Detect stale connections before they cause issues:
// Server sends ping every 30 seconds
setInterval(() => {
wss.clients.forEach((ws) => {
if (ws.isAlive === false) return ws.terminate();
ws.isAlive = false;
ws.ping();
});
}, 30000);
// Client responds automatically
ws.on('ping', () => ws.pong());
2. Exponential Backoff Reconnection
Don't hammer the server when it's down:
function reconnect(attempts: number) {
const delay = Math.min(1000 * Math.pow(2, attempts), 30000);
setTimeout(() => connect(), delay);
}
3. Graceful Shutdown
// Send close frame before terminating
ws.close(1000, 'Normal closure');
// Wait for close frame acknowledgement
ws.on('close', () => {
console.log('Connection closed cleanly');
});
4. Message Serialization
Always validate incoming messages:
function safeParse(data: string): Message | null {
try {
const msg = JSON.parse(data);
if (msg.type && msg.username) {
return msg;
}
} catch {}
return null;
}
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Not handling reconnection | Client stops working on network blip | Implement exponential backoff reconnection |
Ignoring the close event | Memory leaks from stale clients | Always clean up on disconnect |
| Blocking the event loop | Messages delayed | Use async/await properly, avoid CPU-heavy work |
- Missing heartbeat | Stale connections remain | Implement ping/pong |
- Not validating messages | Crashes on malformed data | Always try/catch JSON parsing |
Testing Your WebSocket Implementation
# Using websocat (like curl for WebSockets)
# Install: cargo install websocat
# Connect and send/receive messages
echo '{"type":"join","username":"TestUser","content":"","timestamp":123456}' | \
websocat ws://localhost:8080
# Interactive mode
websocat ws://localhost:8080
Summary
WebSockets enable real-time, bidirectional communication between clients and servers:
- Protocol: HTTP upgrade handshake → persistent TCP connection
- Communication: Full-duplex messaging with minimal overhead
- Lifecycle: Connecting → Open → Messaging → Closing → Closed
- Best practices: Heartbeats, graceful shutdown, reconnection handling
In the next section, we'll build on this foundation to implement pub/sub messaging for multi-room chat systems.
Exercises
Exercise 1: Add Private Messaging
Extend the chat system to support private messages between users:
// Message format for private messages
{
type: 'private',
from: 'Alice',
to: 'Bob',
content: 'Hey Bob, are you there?',
timestamp: 1234567890
}
Requirements:
- Add a
privatemessage type - Route private messages only to the intended recipient
- Show "private message" indicator in the UI
Exercise 2: Typing Indicators
Show when a user is typing:
// Typing indicator message
{
type: 'typing',
username: 'Alice',
isTyping: true,
timestamp: 1234567890
}
Requirements:
- Send
typing.startwhen user starts typing - Send
typing.stopafter 2 seconds of inactivity - Display "Alice is typing..." to other users
Exercise 3: Connection Status
Display real-time connection status to the user:
Requirements:
- Show: Connecting → Connected → Disconnected → Reconnecting
- Use visual indicators (green dot, red dot, spinner)
- Display ping/pong latency in milliseconds
Exercise 4: Message History with Reconnection
When a client reconnects, send them messages they missed:
Requirements:
- Store last 100 messages on the server
- When client reconnects, send messages since their last timestamp
- Deduplicate messages the client already has
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
Pub/Sub Messaging and Message Ordering
Session 7, Part 1 - 45 minutes
Learning Objectives
- Understand the publish-subscribe messaging pattern
- Learn about topic-based and content-based routing
- Implement presence tracking and subscriptions
- Understand message ordering challenges in distributed systems
- Implement sequence numbers for causal ordering
What is Pub/Sub?
The publish-subscribe pattern is a messaging pattern where senders (publishers) send messages to an intermediate system, and the system routes messages to interested receivers (subscribers). Publishers and subscribers are decoupled—they don't know about each other.
Key Benefits
- Decoupling: Publishers don't need to know who subscribes
- Scalability: Add subscribers without changing publishers
- Flexibility: Dynamic subscription management
- Asynchrony: Publishers send and continue; subscribers process when ready
Pub/Sub vs Direct Messaging
graph TB
subgraph "Direct Messaging"
P1[Producer] -->|Direct| C1[Consumer 1]
P1 -->|Direct| C2[Consumer 2]
P1 -->|Direct| C3[Consumer 3]
end
subgraph "Pub/Sub Messaging"
P2[Publisher] -->|Publish| B[Broker]
S1[Subscriber 1] -->|Subscribe| B
S2[Subscriber 2] -->|Subscribe| B
S3[Subscriber 3] -->|Subscribe| B
end
| Aspect | Direct Messaging | Pub/Sub |
|---|---|---|
| Coupling | Tight (producer knows consumers) | Loose (producer doesn't know consumers) |
| Flexibility | Low (changes affect producer) | High (dynamic subscriptions) |
| Complexity | Simple | Moderate (requires broker) |
| Use Case | Point-to-point, request-response | Broadcast, events, notifications |
Pub/Sub Patterns
1. Topic-Based Routing
Subscribers express interest in topics (channels). Messages are routed based on the topic they're published to.
sequenceDiagram
participant S1 as Subscriber 1
participant S2 as Subscriber 2
participant S3 as Subscriber 3
participant B as Broker
participant P as Publisher
Note over S1,S3: Subscription Phase
S1->>B: subscribe("sports")
S2->>B: subscribe("sports")
S3->>B: subscribe("news")
Note over S1,S3: Publishing Phase
P->>B: publish("sports", "Game starts!")
B->>S1: deliver("Game starts!")
B->>S2: deliver("Game starts!")
P->>B: publish("news", "Breaking story!")
B->>S3: deliver("Breaking story!")
Use cases: Chat rooms, notification categories, event streams
2. Content-Based Routing
Subscribers specify filter criteria. Messages are routed based on their content.
graph LR
P[Publisher] -->|{"type": "order", "value": >100}| B[Content Router]
B -->|Matches filter| S1[High-Value Handler]
B -->|Matches filter| S2[Order Logger]
B -.->|No match| S3[Low-Value Handler]
Use cases: Event filtering, complex routing rules, IoT sensor data
3. Presence Tracking
In real-time systems, knowing who is online (presence) is essential for:
- Showing online/offline status
- Delivering messages only to active users
- Managing connections and reconnections
- Handling user disconnections gracefully
stateDiagram-v2
[*] --> Offline: User created
Offline --> Connecting: Connect request
Connecting --> Online: Auth success
Connecting --> Offline: Auth fail
Online --> Away: No activity
Online --> Offline: Disconnect
Away --> Online: Activity detected
Online --> [*]: User deleted
Message Ordering
The Ordering Problem
In distributed systems, messages may arrive out of order due to:
- Network latency variations
- Multiple servers processing messages
- Message retries and retransmissions
- Concurrent publishers
Types of Ordering
| Ordering Type | Description | Difficulty |
|---|---|---|
| FIFO | Messages from same sender arrive in order sent | Easy |
| Causal | Causally related messages are ordered | Moderate |
| Total | All messages ordered globally | Hard |
Why Ordering Matters
Consider a chat application:
sequenceDiagram
participant A as Alice
participant S as Server
participant B as Bob
Note over A,B: Without ordering - confusion!
A->>S: "Let's meet at 5pm"
A->>S: "Never mind, 6pm instead"
S--xB: "Never mind, 6pm instead"
S--xB: "Let's meet at 5pm"
Note over B: Bob sees messages out of order!
With proper ordering using sequence numbers:
sequenceDiagram
participant A as Alice
participant S as Server
participant B as Bob
Note over A,B: With sequence numbers - correct!
A->>S: [msg#1] "Let's meet at 5pm"
A->>S: [msg#2] "Never mind, 6pm instead"
S--xB: [msg#1] "Let's meet at 5pm"
S--xB: [msg#2] "Never mind, 6pm instead"
Note over B: Bob delivers in order by sequence number
Implementation: Pub/Sub Chat with Ordering
Let's build a pub/sub chat system with:
- Topic-based routing (chat rooms)
- Presence tracking
- Message ordering with sequence numbers
TypeScript Implementation
pubsub-server.ts - Pub/Sub server with ordering:
// src: examples/03-chat/ts/pubsub-server.ts
interface Message {
id: string;
room: string;
sender: string;
content: string;
sequence: number;
timestamp: number;
}
interface Subscriber {
id: string;
userId: string;
rooms: Set<string>;
ws: WebSocket;
}
class PubSubServer {
private subscribers: Map<string, Subscriber> = new Map();
private roomSequences: Map<string, number> = new Map();
private messageHistory: Map<string, Message[]> = new Map();
private server: WebSocketServer;
constructor(port: number = 8080) {
this.server = new WebSocketServer({ port });
this.setupHandlers();
console.log(`Pub/Sub server running on port ${port}`);
}
private setupHandlers() {
this.server.on('connection', (ws: WebSocket) => {
const subscriberId = this.generateId();
ws.on('message', (data: string) => {
try {
const msg = JSON.parse(data.toString());
this.handleMessage(subscriberId, msg, ws);
} catch (err) {
ws.send(JSON.stringify({ error: 'Invalid message format' }));
}
});
ws.on('close', () => {
this.handleDisconnect(subscriberId);
});
});
}
private handleMessage(subscriberId: string, msg: any, ws: WebSocket) {
switch (msg.type) {
case 'subscribe':
this.handleSubscribe(subscriberId, msg.room, msg.userId, ws);
break;
case 'unsubscribe':
this.handleUnsubscribe(subscriberId, msg.room);
break;
case 'publish':
this.handlePublish(msg);
break;
case 'get_history':
this.handleGetHistory(msg.room, ws);
break;
}
}
private handleSubscribe(
subscriberId: string,
room: string,
userId: string,
ws: WebSocket
) {
if (!this.subscribers.has(subscriberId)) {
this.subscribers.set(subscriberId, {
id: subscriberId,
userId,
rooms: new Set(),
ws,
});
}
const subscriber = this.subscribers.get(subscriberId)!;
subscriber.rooms.add(room);
// Initialize room state if needed
if (!this.roomSequences.has(room)) {
this.roomSequences.set(room, 0);
this.messageHistory.set(room, []);
}
// Send presence notification
this.broadcast(room, {
type: 'presence',
userId,
action: 'join',
timestamp: Date.now(),
});
// Send current sequence number
ws.send(JSON.stringify({
type: 'subscribed',
room,
sequence: this.roomSequences.get(room),
}));
console.log(`${userId} subscribed to ${room}`);
}
private handleUnsubscribe(subscriberId: string, room: string) {
const subscriber = this.subscribers.get(subscriberId);
if (subscriber) {
subscriber.rooms.delete(room);
// Send presence notification
this.broadcast(room, {
type: 'presence',
userId: subscriber.userId,
action: 'leave',
timestamp: Date.now(),
});
}
}
private handlePublish(msg: any) {
const { room, sender, content } = msg;
const sequence = (this.roomSequences.get(room) || 0) + 1;
this.roomSequences.set(room, sequence);
const message: Message = {
id: this.generateId(),
room,
sender,
content,
sequence,
timestamp: Date.now(),
};
// Store in history
const history = this.messageHistory.get(room) || [];
history.push(message);
this.messageHistory.set(room, history.slice(-100)); // Keep last 100
// Broadcast to all subscribers
this.broadcast(room, {
type: 'message',
...message,
});
}
private handleGetHistory(room: string, ws: WebSocket) {
const history = this.messageHistory.get(room) || [];
ws.send(JSON.stringify({
type: 'history',
room,
messages: history,
}));
}
private broadcast(room: string, payload: any) {
const payloadStr = JSON.stringify(payload);
for (const [_, subscriber] of this.subscribers) {
if (subscriber.rooms.has(room) && subscriber.ws.readyState === WebSocket.OPEN) {
subscriber.ws.send(payloadStr);
}
}
}
private handleDisconnect(subscriberId: string) {
const subscriber = this.subscribers.get(subscriberId);
if (subscriber) {
// Notify all rooms the user was in
for (const room of subscriber.rooms) {
this.broadcast(room, {
type: 'presence',
userId: subscriber.userId,
action: 'leave',
timestamp: Date.now(),
});
}
this.subscribers.delete(subscriberId);
}
}
private generateId(): string {
return Math.random().toString(36).substring(2, 15);
}
}
const PORT = parseInt(process.env.PORT || '8080');
new PubSubServer(PORT);
pubsub-client.ts - Client with ordering buffer:
// src: examples/03-chat/ts/pubsub-client.ts
interface ClientMessage {
type: string;
sequence?: number;
[key: string]: any;
}
class PubSubClient {
private ws: WebSocket | null = null;
private userId: string;
private messageBuffer: Map<string, Map<number, ClientMessage>> = new Map();
private expectedSequence: Map<string, number> = new Map();
private reconnectAttempts = 0;
private maxReconnectAttempts = 5;
constructor(
private url: string,
userId?: string
) {
this.userId = userId || `user-${Math.random().toString(36).substring(7)}`;
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.on('open', () => {
console.log(`Connected as ${this.userId}`);
this.reconnectAttempts = 0;
});
this.ws.on('message', (data: string) => {
const msg: ClientMessage = JSON.parse(data.toString());
this.handleMessage(msg);
});
this.ws.on('close', () => {
console.log('Disconnected. Attempting to reconnect...');
this.reconnect();
});
this.ws.on('error', (err) => {
console.error('WebSocket error:', err);
});
}
private handleMessage(msg: ClientMessage) {
switch (msg.type) {
case 'subscribed':
this.expectedSequence.set(msg.room, (msg.sequence || 0) + 1);
console.log(`Subscribed to ${msg.room} at sequence ${msg.sequence}`);
break;
case 'message':
this.handleOrderedMessage(msg.room, msg);
break;
case 'presence':
console.log(`${msg.userId} ${msg.action}ed`);
break;
case 'history':
console.log(`Received ${msg.messages.length} historical messages`);
msg.messages.forEach((m: ClientMessage) => this.displayMessage(m));
break;
}
}
private handleOrderedMessage(room: string, msg: ClientMessage) {
const seq = msg.sequence!;
// Initialize buffer if needed
if (!this.messageBuffer.has(room)) {
this.messageBuffer.set(room, new Map());
}
const buffer = this.messageBuffer.get(room)!;
const expected = this.expectedSequence.get(room) || 1;
if (seq === expected) {
// Expected message - deliver immediately
this.displayMessage(msg);
this.expectedSequence.set(room, seq + 1);
// Check buffer for next messages
this.deliverBufferedMessages(room);
} else if (seq > expected) {
// Future message - buffer it
buffer.set(seq, msg);
console.log(`Buffered message ${seq} (expecting ${expected})`);
}
// seq < expected: old message, ignore
}
private deliverBufferedMessages(room: string) {
const buffer = this.messageBuffer.get(room);
if (!buffer) return;
const expected = this.expectedSequence.get(room) || 1;
while (buffer.has(expected)) {
const msg = buffer.get(expected)!;
this.displayMessage(msg);
buffer.delete(expected);
this.expectedSequence.set(room, expected + 1);
}
}
private displayMessage(msg: ClientMessage) {
console.log(`[${msg.sequence}] ${msg.sender}: ${msg.content}`);
}
subscribe(room: string) {
this.send({ type: 'subscribe', room, userId: this.userId });
}
unsubscribe(room: string) {
this.send({ type: 'unsubscribe', room });
}
publish(room: string, content: string) {
this.send({
type: 'publish',
room,
sender: this.userId,
content,
});
}
getHistory(room: string) {
this.send({ type: 'get_history', room });
}
private send(payload: any) {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify(payload));
} else {
console.error('WebSocket not connected');
}
}
private reconnect() {
if (this.reconnectAttempts < this.maxReconnectAttempts) {
this.reconnectAttempts++;
const delay = Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000);
setTimeout(() => this.connect(), delay);
} else {
console.error('Max reconnection attempts reached');
}
}
}
// CLI usage
const args = process.argv.slice(2);
const url = args[0] || 'ws://localhost:8080';
const client = new PubSubClient(url);
client.connect();
// Simple readline interface
const readline = require('readline');
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
console.log('Commands: /join <room>, /leave <room>, /history <room>, /quit');
console.log('Any other input will be sent to the current room');
let currentRoom = '';
const showPrompt = () => {
if (currentRoom) {
rl.question(`[${currentRoom}]> `, (input) => {
if (input === '/quit') {
client.ws?.close();
rl.close();
process.exit(0);
} else if (input.startsWith('/join ')) {
currentRoom = input.substring(6);
client.subscribe(currentRoom);
} else if (input.startsWith('/leave ')) {
const room = input.substring(7);
client.unsubscribe(room);
if (room === currentRoom) currentRoom = '';
} else if (input.startsWith('/history ')) {
const room = input.substring(9);
client.getHistory(room);
} else if (input && currentRoom) {
client.publish(currentRoom, input);
}
showPrompt();
});
} else {
rl.question('(no room)> ', (input) => {
if (input.startsWith('/join ')) {
currentRoom = input.substring(6);
client.subscribe(currentRoom);
}
showPrompt();
});
}
};
showPrompt();
Python Implementation
pubsub_server.py - Pub/Sub server with ordering:
# src: examples/03-chat/py/pubsub_server.py
import asyncio
import json
import time
from typing import Dict, Set, List
from dataclasses import dataclass, asdict
import websockets
from websockets.server import WebSocketServerProtocol
@dataclass
class Message:
id: str
room: str
sender: str
content: str
sequence: int
timestamp: int
class PubSubServer:
def __init__(self, port: int = 8080):
self.port = port
self.subscribers: Dict[str, dict] = {}
self.room_sequences: Dict[str, int] = {}
self.message_history: Dict[str, List[Message]] = {}
async def handle_connection(self, ws: WebSocketServerProtocol):
subscriber_id = self._generate_id()
try:
async for message in ws:
try:
data = json.loads(message)
await self.handle_message(subscriber_id, data, ws)
except json.JSONDecodeError:
await ws.send(json.dumps({"error": "Invalid message format"}))
finally:
await self.handle_disconnect(subscriber_id)
async def handle_message(self, subscriber_id: str, msg: dict, ws: WebSocketServerProtocol):
msg_type = msg.get("type")
if msg_type == "subscribe":
await self.handle_subscribe(subscriber_id, msg["room"], msg["userId"], ws)
elif msg_type == "unsubscribe":
await self.handle_unsubscribe(subscriber_id, msg["room"])
elif msg_type == "publish":
await self.handle_publish(msg)
elif msg_type == "get_history":
await self.handle_get_history(msg["room"], ws)
async def handle_subscribe(
self, subscriber_id: str, room: str, user_id: str, ws: WebSocketServerProtocol
):
if subscriber_id not in self.subscribers:
self.subscribers[subscriber_id] = {
"id": subscriber_id,
"userId": user_id,
"rooms": set(),
"ws": ws,
}
subscriber = self.subscribers[subscriber_id]
subscriber["rooms"].add(room)
# Initialize room state
if room not in self.room_sequences:
self.room_sequences[room] = 0
self.message_history[room] = []
# Send presence notification
await self.broadcast(room, {
"type": "presence",
"userId": user_id,
"action": "join",
"timestamp": int(time.time() * 1000),
})
# Send current sequence number
await ws.send(json.dumps({
"type": "subscribed",
"room": room,
"sequence": self.room_sequences[room],
}))
print(f"{user_id} subscribed to {room}")
async def handle_unsubscribe(self, subscriber_id: str, room: str):
subscriber = self.subscribers.get(subscriber_id)
if subscriber:
subscriber["rooms"].discard(room)
await self.broadcast(room, {
"type": "presence",
"userId": subscriber["userId"],
"action": "leave",
"timestamp": int(time.time() * 1000),
})
async def handle_publish(self, msg: dict):
room = msg["room"]
sender = msg["sender"]
content = msg["content"]
sequence = self.room_sequences.get(room, 0) + 1
self.room_sequences[room] = sequence
message = Message(
id=self._generate_id(),
room=room,
sender=sender,
content=content,
sequence=sequence,
timestamp=int(time.time() * 1000),
)
# Store in history
history = self.message_history[room]
history.append(message)
self.message_history[room] = history[-100:] # Keep last 100
# Broadcast
await self.broadcast(room, {
"type": "message",
**asdict(message),
})
async def handle_get_history(self, room: str, ws: WebSocketServerProtocol):
history = self.message_history.get(room, [])
await ws.send(json.dumps({
"type": "history",
"room": room,
"messages": [asdict(m) for m in history],
}))
async def broadcast(self, room: str, payload: dict):
payload_str = json.dumps(payload)
tasks = []
for subscriber in self.subscribers.values():
if room in subscriber["rooms"]:
ws = subscriber["ws"]
if not ws.closed:
tasks.append(ws.send(payload_str))
if tasks:
await asyncio.gather(*tasks, return_exceptions=True)
async def handle_disconnect(self, subscriber_id: str):
subscriber = self.subscribers.get(subscriber_id)
if subscriber:
# Notify all rooms
for room in list(subscriber["rooms"]):
await self.broadcast(room, {
"type": "presence",
"userId": subscriber["userId"],
"action": "leave",
"timestamp": int(time.time() * 1000),
})
del self.subscribers[subscriber_id]
def _generate_id(self) -> str:
import random
import string
return ''.join(random.choices(string.ascii_lowercase + string.digits, k=12))
async def start(self):
print(f"Pub/Sub server running on port {self.port}")
async with websockets.serve(self.handle_connection, "", self.port):
await asyncio.Future() # Run forever
if __name__ == "__main__":
import os
port = int(os.environ.get("PORT", "8080"))
server = PubSubServer(port)
asyncio.run(server.start())
pubsub_client.py - Client with ordering buffer:
# src: examples/03-chat/py/pubsub_client.py
import asyncio
import json
import time
from typing import Dict, Optional
import websockets
from websockets.client import WebSocketClientProtocol
class PubSubClient:
def __init__(self, url: str, user_id: Optional[str] = None):
self.url = url
self.user_id = user_id or f"user-{int(time.time())}"
self.ws: Optional[WebSocketClientProtocol] = None
self.message_buffer: Dict[str, Dict[int, dict]] = {}
self.expected_sequence: Dict[str, int] = {}
self.reconnect_attempts = 0
self.max_reconnect_attempts = 5
async def connect(self):
try:
self.ws = await websockets.connect(self.url)
print(f"Connected as {self.user_id}")
self.reconnect_attempts = 0
asyncio.create_task(self.listen())
except Exception as e:
print(f"Connection failed: {e}")
await self.reconnect()
async def listen(self):
if not self.ws:
return
try:
async for message in self.ws:
data = json.loads(message)
await self.handle_message(data)
except websockets.exceptions.ConnectionClosed:
print("Disconnected. Attempting to reconnect...")
await self.reconnect()
async def handle_message(self, msg: dict):
msg_type = msg.get("type")
if msg_type == "subscribed":
room = msg["room"]
self.expected_sequence[room] = msg.get("sequence", 0) + 1
print(f"Subscribed to {room} at sequence {msg.get('sequence', 0)}")
elif msg_type == "message":
await self.handle_ordered_message(msg["room"], msg)
elif msg_type == "presence":
print(f"{msg['userId']} {msg['action']}ed")
elif msg_type == "history":
print(f"Received {len(msg['messages'])} historical messages")
for m in msg["messages"]:
self.display_message(m)
async def handle_ordered_message(self, room: str, msg: dict):
seq = msg["sequence"]
if room not in self.message_buffer:
self.message_buffer[room] = {}
buffer = self.message_buffer[room]
expected = self.expected_sequence.get(room, 1)
if seq == expected:
# Expected message - deliver immediately
self.display_message(msg)
self.expected_sequence[room] = seq + 1
# Check buffer for next messages
await self.deliver_buffered_messages(room)
elif seq > expected:
# Future message - buffer it
buffer[seq] = msg
print(f"Buffered message {seq} (expecting {expected})")
async def deliver_buffered_messages(self, room: str):
buffer = self.message_buffer.get(room, {})
expected = self.expected_sequence.get(room, 1)
while expected in buffer:
msg = buffer[expected]
self.display_message(msg)
del buffer[expected]
self.expected_sequence[room] = expected + 1
expected += 1
def display_message(self, msg: dict):
print(f"[{msg['sequence']}] {msg['sender']}: {msg['content']}")
async def subscribe(self, room: str):
await self.send({"type": "subscribe", "room": room, "userId": self.user_id})
async def unsubscribe(self, room: str):
await self.send({"type": "unsubscribe", "room": room})
async def publish(self, room: str, content: str):
await self.send({
"type": "publish",
"room": room,
"sender": self.user_id,
"content": content,
})
async def get_history(self, room: str):
await self.send({"type": "get_history", "room": room})
async def send(self, payload: dict):
if self.ws and not self.ws.closed:
await self.ws.send(json.dumps(payload))
else:
print("WebSocket not connected")
async def reconnect(self):
if self.reconnect_attempts < self.max_reconnect_attempts:
self.reconnect_attempts += 1
delay = min(1000 * (2 ** self.reconnect_attempts), 30000) / 1000
await asyncio.sleep(delay)
await self.connect()
else:
print("Max reconnection attempts reached")
async def main():
import sys
url = sys.argv[1] if len(sys.argv) > 1 else "ws://localhost:8080"
client = PubSubClient(url)
await client.connect()
# Simple CLI
current_room = ""
print('Commands: /join <room>, /leave <room>, /history <room>, /quit')
while True:
try:
prompt = f"[{current_room}]> " if current_room else "(no room)> "
line = await asyncio.get_event_loop().run_in_executor(None, input, prompt)
if line == "/quit":
break
elif line.startswith("/join "):
current_room = line[6:]
await client.subscribe(current_room)
elif line.startswith("/leave "):
room = line[7:]
await client.unsubscribe(room)
if room == current_room:
current_room = ""
elif line.startswith("/history "):
room = line[9:]
await client.get_history(room)
elif line and current_room:
await client.publish(current_room, line)
except EOFError:
break
if client.ws:
await client.ws.close()
if __name__ == "__main__":
asyncio.run(main())
Running the Examples
TypeScript Version
cd distributed-systems-course/examples/03-chat/ts
# Install dependencies
npm install
# Start the server
PORT=8080 npx ts-node pubsub-server.ts
# In another terminal, start a client
npx ts-node pubsub-client.ts
Python Version
cd distributed-systems-course/examples/03-chat/py
# Install dependencies
pip install -r requirements.txt
# Start the server
PORT=8080 python pubsub_server.py
# In another terminal, start a client
python pubsub_client.py
Docker Compose
docker-compose.yml (TypeScript):
services:
pubsub-server:
build: .
ports:
- "8080:8080"
environment:
- PORT=8080
docker-compose up
Testing the Pub/Sub System
Test 1: Basic Pub/Sub
- Start three clients in separate terminals
- Client 1:
/join general - Client 2:
/join general - Client 1:
Hello everyone! - Client 2 should receive the message
- Client 3:
/join general - Client 3:
/history general- should see previous messages
Test 2: Multiple Rooms
- Client 1:
/join sports - Client 2:
/join news - Client 1:
Game starting!(only in sports) - Client 2:
Breaking news!(only in news) - Client 3:
/join sportsand/join news(receives both)
Test 3: Message Ordering
- Start a client and join a room
- Send messages rapidly:
msg1,msg2,msg3 - Observe sequence numbers:
[1],[2],[3] - Note the order is preserved
Test 4: Presence Tracking
- Start two clients
- Both join the same room
- Observe presence notifications (user joined/left)
- Disconnect one client (Ctrl+C)
- Other client receives leave notification
Exercises
Exercise 1: Implement Last-Message Cache
Add a feature to store only the last N messages per room (already implemented as 100 in the code).
Tasks:
- Make the history size configurable via environment variable
- Add a
/clear_historycommand for admins - Add TTL (time-to-live) for old messages
Exercise 2: Implement Private Messages
Extend the system to support direct messages between users.
Requirements:
- Private messages should only be delivered to the recipient
- Use a special topic format:
@username - Include sender authentication
Hint: You'll need to modify the handlePublish method to check for @ prefix.
Exercise 3: Add Message Acknowledgments
Implement acknowledgments to guarantee message delivery.
Requirements:
- Clients must ACK received messages
- Server tracks unacknowledged messages
- On reconnect, server resends unacknowledged messages
Hint: Add an ack message type and track pending messages per subscriber.
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Sequence number desync | Messages not displayed | Re-subscribe to reset sequence |
| Memory leak from history | Growing memory usage | Implement history size limits |
| Missing presence updates | Stale online status | Add heartbeat/ping messages |
| Race conditions | Messages lost during reconnect | Buffer messages during disconnection |
Real-World Examples
| System | Pub/Sub Implementation | Ordering Strategy |
|---|---|---|
| Redis Pub/Sub | Topic-based channels | No ordering guarantees |
| Apache Kafka | Partitioned topics | Per-partition ordering |
| Google Cloud Pub/Sub | Topic-based with subscriptions | Exactly-once delivery |
| AWS SNS | Topic-based fanout | Best-effort ordering |
| RabbitMQ | Exchange/queue binding | FIFO within queue |
Summary
- Pub/Sub decouples publishers from subscribers through an intermediary broker
- Topic-based routing is the simplest and most common pattern
- Presence tracking enables online/offline status in real-time systems
- Message ordering requires sequence numbers and buffering
- Causal ordering is achievable with modest complexity
- Total ordering is expensive and often unnecessary
Next: Chat System Implementation →
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
Chat System Implementation
Session 7 - Full session (90 minutes)
Learning Objectives
- Build a complete real-time chat system with WebSockets
- Implement message ordering with sequence numbers
- Handle presence management (online/offline users)
- Add message persistence for history
- Deploy multiple chat rooms using Docker Compose
System Architecture
Our chat system brings together all the concepts from Sessions 6-7:
graph TB
subgraph "Clients"
C1[User 1 Browser]
C2[User 2 Browser]
C3[User 3 Browser]
end
subgraph "Chat Server"
WS[WebSocket Handler]
PS[Pub/Sub Engine]
SM[Sequence Manager]
PM[Presence Manager]
MS[Message Store]
WS --> PS
WS --> SM
WS --> PM
PS --> SM
SM --> MS
end
C1 -->|WebSocket| WS
C2 -->|WebSocket| WS
C3 -->|WebSocket| WS
subgraph "Persistence"
DB[(Messages DB)]
end
MS --> DB
style WS fill:#e3f2fd
style PS fill:#fff3e0
style SM fill:#f3e5f5
Key Components
| Component | Responsibility |
|---|---|
| WebSocket Handler | Manages client connections, sends/receives messages |
| Pub/Sub Engine | Routes messages to rooms, handles subscriptions |
| Sequence Manager | Assigns sequence numbers, ensures ordering |
| Presence Manager | Tracks online/offline status, heartbeat |
| Message Store | Persists messages for history and replay |
Message Flow
sequenceDiagram
participant U1 as User 1
participant WS as WebSocket Handler
participant PS as Pub/Sub
participant SM as Sequencer
participant DB as Message Store
participant U2 as User 2
U1->>WS: CONNECT("general")
WS->>PS: subscribe("general", U1)
WS->>PM: mark_online(U1)
PS->>U2: BROADCAST("User 1 joined")
Note over U1,U2: Sending a message
U1->>WS: SEND("general", "Hello!")
WS->>PS: publish("general", msg)
PS->>SM: get_sequence(msg)
SM->>DB: save(msg, seq=1)
SM->>PS: return seq=1
PS->>U1: BROADCAST(msg, seq=1)
PS->>U2: BROADCAST(msg, seq=1)
Note over U1,U2: User 2 reconnects
U2->>WS: CONNECT("general", last_seq=0)
WS->>DB: get_messages(since=0)
DB->>U2: REPLAY([msg1, msg2, ...])
TypeScript Implementation
Project Structure
chat-system/
├── package.json
├── tsconfig.json
├── src/
│ ├── types.ts # Type definitions
│ ├── pub-sub.ts # Pub/Sub engine
│ ├── sequencer.ts # Sequence number manager
│ ├── presence.ts # Presence management
│ ├── store.ts # Message persistence
│ ├── server.ts # WebSocket server
│ └── index.ts # Entry point
├── public/
│ └── client.html # Demo client
├── Dockerfile
└── docker-compose.yml
1. Type Definitions
// src/types.ts
export interface Message {
id: string;
room: string;
user: string;
content: string;
sequence: number;
timestamp: number;
}
export interface Client {
id: string;
user: string;
rooms: Set<string>;
ws: WebSocket;
lastSeen: number;
}
export interface Presence {
user: string;
status: 'online' | 'offline' | 'away';
lastSeen: number;
}
export type MessageHandler = (client: Client, message: Message) => void;
2. Pub/Sub Engine
// src/pub-sub.ts
import { Message, Client, MessageHandler } from './types';
export class PubSub {
private subscriptions: Map<string, Set<Client>> = new Map();
private handlers: Map<string, MessageHandler[]> = new Map();
subscribe(room: string, client: Client): void {
if (!this.subscriptions.has(room)) {
this.subscriptions.set(room, new Set());
}
this.subscriptions.get(room)!.add(client);
client.rooms.add(room);
}
unsubscribe(room: string, client: Client): void {
const subs = this.subscriptions.get(room);
if (subs) {
subs.delete(client);
if (subs.size === 0) {
this.subscriptions.delete(room);
}
}
client.rooms.delete(room);
}
publish(room: string, message: Message): void {
const subs = this.subscriptions.get(room);
if (subs) {
for (const client of subs) {
this.sendToClient(client, message);
}
}
this.emit('message', message);
}
on(event: string, handler: MessageHandler): void {
if (!this.handlers.has(event)) {
this.handlers.set(event, []);
}
this.handlers.get(event)!.push(handler);
}
private emit(event: string, message: Message): void {
const handlers = this.handlers.get(event) || [];
handlers.forEach(h => h(null!, message));
}
private sendToClient(client: Client, message: Message): void {
if (client.ws.readyState === client.ws.OPEN) {
client.ws.send(JSON.stringify({
type: 'message',
data: message
}));
}
}
getSubscribers(room: string): Client[] {
return Array.from(this.subscriptions.get(room) || []);
}
getRooms(): string[] {
return Array.from(this.subscriptions.keys());
}
}
3. Sequence Manager
// src/sequencer.ts
import { Message } from './types';
export class Sequencer {
private sequences: Map<string, number> = new Map();
getNext(room: string): number {
const current = this.sequences.get(room) || 0;
const next = current + 1;
this.sequences.set(room, next);
return next;
}
setCurrent(room: string, sequence: number): void {
this.sequences.set(room, sequence);
}
getCurrent(room: string): number {
return this.sequences.get(room) || 0;
}
sequenceMessage(message: Message): Message {
const seq = this.getNext(message.room);
return { ...message, sequence: seq };
}
}
4. Presence Manager
// src/presence.ts
import { Client, Presence } from './types';
const HEARTBEAT_INTERVAL = 30000; // 30 seconds
const OFFLINE_TIMEOUT = 60000; // 60 seconds
export class PresenceManager {
private users: Map<string, Presence> = new Map();
private clients: Map<string, Client> = new Map();
private intervals: Map<string, NodeJS.Timeout> = new Map();
register(client: Client): void {
this.clients.set(client.id, client);
this.updatePresence(client.user, 'online');
this.startHeartbeat(client);
}
unregister(client: Client): void {
this.stopHeartbeat(client);
this.clients.delete(client.id);
this.updatePresence(client.user, 'offline');
}
updatePresence(user: string, status: 'online' | 'offline' | 'away'): void {
this.users.set(user, {
user,
status,
lastSeen: Date.now()
});
}
getPresence(user: string): Presence | undefined {
return this.users.get(user);
}
getOnlineUsers(): string[] {
const now = Date.now();
return Array.from(this.users.values())
.filter(p => p.status === 'online' && (now - p.lastSeen) < OFFLINE_TIMEOUT)
.map(p => p.user);
}
getPresenceInRoom(room: string): Presence[] {
const now = Date.now();
const usersInRoom = new Set<string>();
for (const client of this.clients.values()) {
if (client.rooms.has(room)) {
usersInRoom.add(client.user);
}
}
return Array.from(usersInRoom)
.map(user => this.users.get(user)!)
.filter(p => p && (now - p.lastSeen) < OFFLINE_TIMEOUT);
}
private startHeartbeat(client: Client): void {
const interval = setInterval(() => {
if (client.ws.readyState === client.ws.OPEN) {
client.ws.send(JSON.stringify({ type: 'heartbeat' }));
this.updatePresence(client.user, 'online');
}
}, HEARTBEAT_INTERVAL);
this.intervals.set(client.id, interval);
}
private stopHeartbeat(client: Client): void {
const interval = this.intervals.get(client.id);
if (interval) {
clearInterval(interval);
this.intervals.delete(client.id);
}
}
cleanup(): void {
for (const interval of this.intervals.values()) {
clearInterval(interval);
}
this.intervals.clear();
}
}
5. Message Store
// src/store.ts
import { Message } from './types';
import fs from 'fs/promises';
import path from 'path';
export class MessageStore {
private basePath: string;
constructor(basePath: string = './data/messages') {
this.basePath = basePath;
}
async save(message: Message): Promise<void> {
const roomPath = path.join(this.basePath, message.room);
await fs.mkdir(roomPath, { recursive: true });
const filename = path.join(roomPath, `${message.sequence}.json`);
await fs.writeFile(filename, JSON.stringify(message, null, 2));
}
async getMessages(room: string, since: number = 0, limit: number = 100): Promise<Message[]> {
const roomPath = path.join(this.basePath, room);
const messages: Message[] = [];
try {
const files = await fs.readdir(roomPath);
const jsonFiles = files
.filter(f => f.endsWith('.json'))
.map(f => parseInt(f.replace('.json', '')))
.filter(seq => seq > since)
.sort((a, b) => a - b)
.slice(0, limit);
for (const seq of jsonFiles) {
const content = await fs.readFile(path.join(roomPath, `${seq}.json`), 'utf-8');
messages.push(JSON.parse(content));
}
} catch (err) {
// Room doesn't exist yet
}
return messages;
}
async getLastSequence(room: string): Promise<number> {
const roomPath = path.join(this.basePath, room);
try {
const files = await fs.readdir(roomPath);
const sequences = files
.filter(f => f.endsWith('.json'))
.map(f => parseInt(f.replace('.json', '')));
return sequences.length > 0 ? Math.max(...sequences) : 0;
} catch {
return 0;
}
}
}
6. WebSocket Server
// src/server.ts
import { WebSocketServer, WebSocket } from 'ws';
import { createServer } from 'http';
import { v4 as uuidv4 } from 'uuid';
import { PubSub } from './pub-sub';
import { Sequencer } from './sequencer';
import { PresenceManager } from './presence';
import { MessageStore } from './store';
import { Client, Message } from './types';
const PORT = process.env.PORT || 8080;
export class ChatServer {
private wss: WebSocketServer;
private pubSub: PubSub;
private sequencer: Sequencer;
private presence: PresenceManager;
private store: MessageStore;
constructor() {
const server = createServer();
this.wss = new WebSocketServer({ server });
this.pubSub = new PubSub();
this.sequencer = new Sequencer();
this.presence = new PresenceManager();
this.store = new MessageStore();
this.setupHandlers();
}
private setupHandlers(): void {
this.wss.on('connection', (ws: WebSocket) => {
const clientId = uuidv4();
const client: Client = {
id: clientId,
user: `user_${clientId.slice(0, 8)}`,
rooms: new Set(),
ws,
lastSeen: Date.now()
};
console.log(`Client connected: ${client.id}`);
ws.on('message', async (data: string) => {
try {
const msg = JSON.parse(data);
await this.handleMessage(client, msg);
} catch (err) {
console.error('Error handling message:', err);
}
});
ws.on('close', () => {
console.log(`Client disconnected: ${client.id}`);
for (const room of client.rooms) {
this.pubSub.publish(room, {
id: uuidv4(),
room,
user: 'system',
content: `${client.user} left the room`,
sequence: this.sequencer.getCurrent(room),
timestamp: Date.now()
});
this.pubSub.unsubscribe(room, client);
}
this.presence.unregister(client);
});
// Send welcome message
this.sendToClient(client, {
type: 'connected',
data: { clientId: client.id, user: client.user }
});
this.presence.register(client);
});
}
private async handleMessage(client: Client, msg: any): Promise<void> {
switch (msg.type) {
case 'join':
await this.handleJoin(client, msg.room);
break;
case 'leave':
this.handleLeave(client, msg.room);
break;
case 'message':
await this.handleChatMessage(client, msg.data);
break;
case 'presence':
this.handlePresenceRequest(client, msg.room);
break;
case 'history':
await this.handleHistoryRequest(client, msg.room, msg.since);
break;
default:
console.log('Unknown message type:', msg.type);
}
}
private async handleJoin(client: Client, room: string): Promise<void> {
console.log(`${client.user} joining room: ${room}`);
// Subscribe to room
this.pubSub.subscribe(room, client);
// Send current presence
const presence = this.presence.getPresenceInRoom(room);
this.sendToClient(client, {
type: 'presence',
data: { room, users: presence }
});
// Announce join
this.pubSub.publish(room, {
id: uuidv4(),
room,
user: 'system',
content: `${client.user} joined the room`,
sequence: this.sequencer.getCurrent(room),
timestamp: Date.now()
});
// Send recent messages
const history = await this.store.getMessages(room, 0, 50);
if (history.length > 0) {
this.sendToClient(client, {
type: 'history',
data: { room, messages: history }
});
}
}
private handleLeave(client: Client, room: string): void {
console.log(`${client.user} leaving room: ${room}`);
this.pubSub.unsubscribe(room, client);
this.pubSub.publish(room, {
id: uuidv4(),
room,
user: 'system',
content: `${client.user} left the room`,
sequence: this.sequencer.getCurrent(room),
timestamp: Date.now()
});
}
private async handleChatMessage(client: Client, data: any): Promise<void> {
const { room, content } = data;
if (!client.rooms.has(room)) {
this.sendError(client, 'Not subscribed to room');
return;
}
const message: Message = {
id: uuidv4(),
room,
user: client.user,
content,
sequence: 0, // Will be assigned
timestamp: Date.now()
};
// Assign sequence number
const sequenced = this.sequencer.sequenceMessage(message);
// Save to store
await this.store.save(sequenced);
// Publish to all subscribers
this.pubSub.publish(room, sequenced);
console.log(`[${room}] ${client.user}: ${content} (seq: ${sequenced.sequence})`);
}
private handlePresenceRequest(client: Client, room: string): void {
const presence = this.presence.getPresenceInRoom(room);
this.sendToClient(client, {
type: 'presence',
data: { room, users: presence }
});
}
private async handleHistoryRequest(client: Client, room: string, since: number = 0): Promise<void> {
const messages = await this.store.getMessages(room, since);
this.sendToClient(client, {
type: 'history',
data: { room, messages }
});
}
private sendToClient(client: Client, data: any): void {
if (client.ws.readyState === client.ws.OPEN) {
client.ws.send(JSON.stringify(data));
}
}
private sendError(client: Client, message: string): void {
this.sendToClient(client, {
type: 'error',
data: { message }
});
}
listen(): void {
const server = this.wss.server!;
server.listen(PORT, () => {
console.log(`Chat server listening on port ${PORT}`);
});
}
}
7. Entry Point
// src/index.ts
import { ChatServer } from './server';
const server = new ChatServer();
server.listen();
8. Package.json
{
"name": "chat-system",
"version": "1.0.0",
"description": "Real-time chat system with WebSockets",
"main": "dist/index.js",
"scripts": {
"build": "tsc",
"start": "node dist/index.js",
"dev": "ts-node src/index.ts"
},
"dependencies": {
"ws": "^8.18.0",
"uuid": "^11.0.3"
},
"devDependencies": {
"@types/node": "^22.10.2",
"@types/ws": "^8.5.13",
"@types/uuid": "^10.0.0",
"ts-node": "^10.9.2",
"typescript": "^5.7.2"
}
}
9. Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 8080
CMD ["npm", "start"]
10. Docker Compose
version: '3.8'
services:
chat:
build: .
ports:
- "8080:8080"
volumes:
- ./data:/app/data
environment:
- PORT=8080
restart: unless-stopped
Python Implementation
Project Structure
chat-system/
├── requirements.txt
├── src/
│ ├── __init__.py
│ ├── types.py
│ ├── pub_sub.py
│ ├── sequencer.py
│ ├── presence.py
│ ├── store.py
│ ├── server.py
│ └── main.py
├── public/
│ └── client.html
├── Dockerfile
└── docker-compose.yml
1. Type Definitions
# src/types.py
from dataclasses import dataclass, field
from typing import Set
import websockets.server
import datetime
@dataclass
class Message:
id: str
room: str
user: str
content: str
sequence: int
timestamp: float
@dataclass
class Client:
id: str
user: str
rooms: Set[str] = field(default_factory=set)
websocket: websockets.server.WebSocketServerProtocol = None
last_seen: float = field(default_factory=lambda: datetime.datetime.now().timestamp())
@dataclass
class Presence:
user: str
status: str # 'online', 'offline', 'away'
last_seen: float
2. Pub/Sub Engine
# src/pub_sub.py
from typing import Dict, Set, List, Callable, Any
from .types import Message, Client
class PubSub:
def __init__(self):
self.subscriptions: Dict[str, Set[Client]] = {}
self.handlers: Dict[str, List[Callable]] = {}
def subscribe(self, room: str, client: Client) -> None:
if room not in self.subscriptions:
self.subscriptions[room] = set()
self.subscriptions[room].add(client)
client.rooms.add(room)
def unsubscribe(self, room: str, client: Client) -> None:
if room in self.subscriptions:
self.subscriptions[room].discard(client)
if not self.subscriptions[room]:
del self.subscriptions[room]
client.rooms.discard(room)
async def publish(self, room: str, message: Message) -> None:
if room in self.subscriptions:
for client in self.subscriptions[room]:
await self._send_to_client(client, message)
await self._emit('message', message)
async def _send_to_client(self, client: Client, message: Message) -> None:
if client.websocket and not client.websocket.closed:
import json
await client.websocket.send(json.dumps({
'type': 'message',
'data': message.__dict__
}))
async def _emit(self, event: str, message: Message) -> None:
handlers = self.handlers.get(event, [])
for handler in handlers:
await handler(None, message)
def get_subscribers(self, room: str) -> List[Client]:
return list(self.subscriptions.get(room, set()))
def get_rooms(self) -> List[str]:
return list(self.subscriptions.keys())
3. Sequence Manager
# src/sequencer.py
from typing import Dict
from .types import Message
class Sequencer:
def __init__(self):
self.sequences: Dict[str, int] = {}
def get_next(self, room: str) -> int:
current = self.sequences.get(room, 0)
next_seq = current + 1
self.sequences[room] = next_seq
return next_seq
def set_current(self, room: str, sequence: int) -> None:
self.sequences[room] = sequence
def get_current(self, room: str) -> int:
return self.sequences.get(room, 0)
def sequence_message(self, message: Message) -> Message:
seq = self.get_next(message.room)
message.sequence = seq
return message
4. Presence Manager
# src/presence.py
import asyncio
import datetime
from typing import Dict, List, Set
from .types import Client, Presence
HEARTBEAT_INTERVAL = 30 # seconds
OFFLINE_TIMEOUT = 60 # seconds
class PresenceManager:
def __init__(self):
self.users: Dict[str, Presence] = {}
self.clients: Dict[str, Client] = {}
self.tasks: Dict[str, asyncio.Task] = {}
def register(self, client: Client) -> None:
self.clients[client.id] = client
self.update_presence(client.user, 'online')
self.tasks[client.id] = asyncio.create_task(self._heartbeat(client))
def unregister(self, client: Client) -> None:
if client.id in self.tasks:
self.tasks[client.id].cancel()
del self.tasks[client.id]
if client.id in self.clients:
del self.clients[client.id]
self.update_presence(client.user, 'offline')
def update_presence(self, user: str, status: str) -> None:
self.users[user] = Presence(
user=user,
status=status,
last_seen=datetime.datetime.now().timestamp()
)
def get_presence(self, user: str) -> Presence | None:
return self.users.get(user)
def get_online_users(self) -> List[str]:
now = datetime.datetime.now().timestamp()
return [
p.user for p in self.users.values()
if p.status == 'online' and (now - p.last_seen) < OFFLINE_TIMEOUT
]
def get_presence_in_room(self, room: str) -> List[Presence]:
now = datetime.datetime.now().timestamp()
users_in_room = set()
for client in self.clients.values():
if room in client.rooms:
users_in_room.add(client.user)
return [
self.users.get(user)
for user in users_in_room
if user in self.users and (now - self.users[user].last_seen) < OFFLINE_TIMEOUT
]
async def _heartbeat(self, client: Client) -> None:
import json
while True:
try:
if client.websocket and not client.websocket.closed:
await client.websocket.send(json.dumps({'type': 'heartbeat'}))
self.update_presence(client.user, 'online')
except asyncio.CancelledError:
break
except Exception:
pass
await asyncio.sleep(HEARTBEAT_INTERVAL)
def cleanup(self) -> None:
for task in self.tasks.values():
task.cancel()
self.tasks.clear()
5. Message Store
# src/store.py
import os
import json
import asyncio
from pathlib import Path
from typing import List
from .types import Message
class MessageStore:
def __init__(self, base_path: str = './data/messages'):
self.base_path = Path(base_path)
async def save(self, message: Message) -> None:
room_path = self.base_path / message.room
room_path.mkdir(parents=True, exist_ok=True)
filename = room_path / f'{message.sequence}.json'
with open(filename, 'w') as f:
json.dump(message.__dict__, f, indent=2)
async def get_messages(self, room: str, since: int = 0, limit: int = 100) -> List[Message]:
room_path = self.base_path / room
messages = []
if not room_path.exists():
return messages
try:
files = [f for f in os.listdir(room_path) if f.endswith('.json')]
sequences = sorted([
int(f.replace('.json', ''))
for f in files
if int(f.replace('.json', '')) > since
])[:limit]
for seq in sequences:
with open(room_path / f'{seq}.json', 'r') as f:
data = json.load(f)
messages.append(Message(**data))
except FileNotFoundError:
pass
return messages
async def get_last_sequence(self, room: str) -> int:
room_path = self.base_path / room
if not room_path.exists():
return 0
try:
files = [f for f in os.listdir(room_path) if f.endswith('.json')]
sequences = [int(f.replace('.json', '')) for f in files]
return max(sequences) if sequences else 0
except FileNotFoundError:
return 0
6. WebSocket Server
# src/server.py
import websockets
import json
import uuid
import asyncio
from typing import Any
from .pub_sub import PubSub
from .sequencer import Sequencer
from .presence import PresenceManager
from .store import MessageStore
from .types import Client, Message
PORT = int(os.getenv('PORT', 8080))
class ChatServer:
def __init__(self):
self.pub_sub = PubSub()
self.sequencer = Sequencer()
self.presence = PresenceManager()
self.store = MessageStore()
async def handle_client(self, websocket, path):
client_id = str(uuid.uuid4())
client = Client(
id=client_id,
user=f"user_{client_id[:8]}",
websocket=websocket,
rooms=set()
)
print(f"Client connected: {client.id}")
await self._send_to_client(client, {
'type': 'connected',
'data': {'clientId': client.id, 'user': client.user}
})
self.presence.register(client)
try:
async for message in websocket:
msg = json.loads(message)
await self.handle_message(client, msg)
except websockets.exceptions.ConnectionClosed:
print(f"Client disconnected: {client.id}")
finally:
for room in list(client.rooms):
await self.pub_sub.publish(room, Message(
id=str(uuid.uuid4()),
room=room,
user='system',
content=f"{client.user} left the room",
sequence=self.sequencer.get_current(room),
timestamp=asyncio.get_event_loop().time()
))
self.pub_sub.unsubscribe(room, client)
self.presence.unregister(client)
async def handle_message(self, client: Client, msg: Any) -> None:
handlers = {
'join': self.handle_join,
'leave': self.handle_leave,
'message': self.handle_chat_message,
'presence': self.handle_presence_request,
'history': self.handle_history_request
}
handler = handlers.get(msg.get('type'))
if handler:
await handler(client, msg)
else:
print(f"Unknown message type: {msg.get('type')}")
async def handle_join(self, client: Client, msg: Any) -> None:
room = msg.get('room')
print(f"{client.user} joining room: {room}")
self.pub_sub.subscribe(room, client)
presence = self.presence.get_presence_in_room(room)
await self._send_to_client(client, {
'type': 'presence',
'data': {'room': room, 'users': [p.__dict__ for p in presence]}
})
await self.pub_sub.publish(room, Message(
id=str(uuid.uuid4()),
room=room,
user='system',
content=f"{client.user} joined the room",
sequence=self.sequencer.get_current(room),
timestamp=asyncio.get_event_loop().time()
))
history = await self.store.get_messages(room, 0, 50)
if history:
await self._send_to_client(client, {
'type': 'history',
'data': {'room': room, 'messages': [m.__dict__ for m in history]}
})
def handle_leave(self, client: Client, msg: Any) -> None:
room = msg.get('room')
print(f"{client.user} leaving room: {room}")
self.pub_sub.unsubscribe(room, client)
async def handle_chat_message(self, client: Client, msg: Any) -> None:
data = msg.get('data', {})
room = data.get('room')
if room not in client.rooms:
await self._send_error(client, 'Not subscribed to room')
return
message = Message(
id=str(uuid.uuid4()),
room=room,
user=client.user,
content=data.get('content', ''),
sequence=0,
timestamp=asyncio.get_event_loop().time()
)
sequenced = self.sequencer.sequence_message(message)
await self.store.save(sequenced)
await self.pub_sub.publish(room, sequenced)
print(f"[{room}] {client.user}: {sequenced.content} (seq: {sequenced.sequence})")
async def handle_presence_request(self, client: Client, msg: Any) -> None:
room = msg.get('room')
presence = self.presence.get_presence_in_room(room)
await self._send_to_client(client, {
'type': 'presence',
'data': {'room': room, 'users': [p.__dict__ for p in presence]}
})
async def handle_history_request(self, client: Client, msg: Any) -> None:
room = msg.get('room')
since = msg.get('since', 0)
messages = await self.store.get_messages(room, since)
await self._send_to_client(client, {
'type': 'history',
'data': {'room': room, 'messages': [m.__dict__ for m in messages]}
})
async def _send_to_client(self, client: Client, data: Any) -> None:
if client.websocket and not client.websocket.closed:
await client.websocket.send(json.dumps(data))
async def _send_error(self, client: Client, message: str) -> None:
await self._send_to_client(client, {
'type': 'error',
'data': {'message': message}
})
async def start(self):
print(f"Chat server listening on port {PORT}")
async with websockets.serve(self.handle_client, "", PORT):
await asyncio.Future() # Run forever
7. Entry Point
# src/main.py
import asyncio
import os
from server import ChatServer
async def main():
server = ChatServer()
await server.start()
if __name__ == '__main__':
asyncio.run(main())
8. Requirements
websockets==13.1
aiofiles==24.1.0
9. Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["python", "src/main.py"]
10. Docker Compose
version: '3.8'
services:
chat:
build: .
ports:
- "8080:8080"
volumes:
- ./data:/app/data
environment:
- PORT=8080
restart: unless-stopped
Running the Chat System
TypeScript
# Install dependencies
npm install
# Build
npm run build
# Start server
npm start
# With Docker Compose
docker-compose up
Python
# Install dependencies
pip install -r requirements.txt
# Start server
python src/main.py
# With Docker Compose
docker-compose up
Exercises
Exercise 1: Basic Chat Operations
- Start the chat server
- Connect two WebSocket clients
- Join the same room
- Send messages and verify both clients receive them
- Leave the room and verify the broadcast
Exercise 2: Message Ordering
- Send multiple messages rapidly from different clients
- Verify all messages have unique, sequential sequence numbers
- Disconnect and reconnect a client
- Request message history and verify ordering is preserved
Exercise 3: Presence Management
- Connect multiple clients to different rooms
- Join a room and verify presence broadcasts
- Simulate a network failure (kill a client without proper leave)
- Verify offline detection kicks in after timeout
Exercise 4: Message Persistence
- Send messages to a room
- Stop the server
- Verify messages are saved to disk
- Restart the server
- Connect a new client and verify it receives message history
Common Pitfalls
| Issue | Cause | Fix |
|---|---|---|
| Messages not ordered | Missing sequence numbers | Always sequence before publishing |
| Old messages not received | Not requesting history on join | Implement replay on connect |
| Presence shows offline | Heartbeat not sent | Ensure heartbeat loop is running |
| Duplicate messages | Re-publishing saved messages | Only publish new messages, not history |
Key Takeaways
- Pub/Sub enables scalable multi-room communication
- Sequence numbers guarantee message ordering across all clients
- Presence management requires both active heartbeats and passive timeout detection
- Message persistence allows clients to reconnect and receive history
- Docker Compose simplifies deployment and testing of the complete system
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What is Consensus?
Session 8 - Full session
Learning Objectives
- Understand the consensus problem in distributed systems
- Learn the difference between safety and liveness properties
- Explore the FLP impossibility result
- Understand why consensus algorithms are necessary
- Compare Raft and Paxos approaches
The Consensus Problem
In distributed systems, consensus is the problem of getting multiple nodes to agree on a single value. This sounds simple, but it's fundamental to building reliable distributed systems.
Why Do We Need Consensus?
Consider these scenarios:
- Leader Election: Multiple nodes need to agree on who is the leader
- Configuration Changes: All nodes must agree on a new configuration
- Replicated State Machines: All nodes must apply operations in the same order
- Distributed Transactions: All participants must agree to commit or abort
Without consensus, distributed systems can suffer from:
- Split-brain scenarios (multiple leaders)
- Inconsistent state across nodes
- Data corruption from conflicting writes
- Unavailable systems during network partitions
graph LR
subgraph "Without Consensus"
N1[Node A: value=1]
N2[Node B: value=2]
N3[Node C: value=3]
N1 --- N2 --- N3
Problem[Which value is correct?]
end
subgraph "With Consensus"
A1[Node A: value=2]
A2[Node B: value=2]
A3[Node C: value=2]
A1 --- A2 --- A3
Solved[All nodes agree]
end
Formal Definition
The consensus problem requires a system to satisfy these properties:
1. Agreement (Safety)
All correct nodes must agree on the same value.
If node A outputs
vand node B outputsv', thenv = v'
2. Validity
If all correct nodes propose the same value v, then all correct nodes decide v.
The decided value must have been proposed by some node
3. Termination (Liveness)
All correct nodes eventually decide on some value.
The algorithm must make progress, not run forever
4. Integrity
Each node decides at most once.
A node cannot change its decision after deciding
Safety vs Liveness
Understanding the trade-off between safety and liveness is crucial for distributed systems:
graph TB
subgraph "Safety Properties"
S1[Agreement]
S2[Validity]
S3[Integrity]
style S1 fill:#90EE90
style S2 fill:#90EE90
style S3 fill:#90EE90
end
subgraph "Liveness Properties"
L1[Termination]
L2[Progress]
style L1 fill:#FFB6C1
style L2 fill:#FFB6C1
end
Safety["Nothing bad happens<br/>State is always consistent"]
Liveness["Something good happens<br/>System makes progress"]
S1 & S2 & S3 --> Safety
L1 & L2 --> Liveness
Safety --> Tradeoff["In networks,<br/>you can't guarantee both<br/>during partitions"]
Liveness --> Tradeoff
| Safety | Liveness |
|---|---|
| "Nothing bad happens" | "Something good happens" |
| State is always valid | System makes progress |
| No corruption, no conflicts | Operations complete eventually |
| Can be maintained during partitions | May be sacrificed during partitions |
Example: During a network partition (CAP theorem), a CP system maintains safety (no inconsistent writes) but sacrifices liveness (writes may be rejected). An AP system maintains liveness (writes succeed) but may sacrifice safety (temporary inconsistencies).
Why Consensus is Hard
Challenge 1: No Global Clock
Nodes don't share a synchronized clock, making it hard to order events:
sequenceDiagram
participant A as Node A (t=10:00:01)
participant B as Node B (t=10:00:05)
participant C as Node C (t=10:00:03)
Note over A: A proposes value=1
A->>B: send(value=1)
Note over B: B receives at t=10:00:07
Note over C: C proposes value=2
C->>B: send(value=2)
Note over B: B receives at t=10:00:08
Note over B: Which value came first?
Challenge 2: Message Loss and Delays
Messages can be lost, delayed, or reordered:
stateDiagram-v2
[*] --> Sent: Node sends message
Sent --> Delivered: Message arrives
Sent --> Lost: Message lost
Sent --> Delayed: Network slow
Delayed --> Delivered: Eventually arrives
Lost --> Retry: Node resends
Delivered --> [*]
Challenge 3: Node Failures
Nodes can crash at any time, potentially while holding critical information:
graph TB
subgraph "Cluster State"
N1[Node 1: Alive]
N2[Node 2: CRASHED<br/>Had uncommitted data]
N3[Node 3: Alive]
N4[Node 4: Alive]
N1 --- N2
N2 --- N3
N3 --- N4
end
Q[What happens to<br/>Node 2's data?]
The FLP Impossibility Result
In 1985, Fischer, Lynch, and Paterson proved the FLP Impossibility Result:
In an asynchronous network, even with only one faulty node, no deterministic consensus algorithm can guarantee safety, liveness, and termination.
What This Means
graph TB
A[Asynchronous Network] --> B[No timing assumptions]
B --> C[Messages can take arbitrarily long]
C --> D[Cannot distinguish slow node from crashed node]
D --> E[Cannot guarantee termination]
E --> F[FLP: Consensus impossible<br/>in pure async systems]
How We Work Around It
Real systems handle FLP by relaxing some assumptions:
- Partial Synchrony: Assume networks are eventually synchronous
- Randomization: Use randomized algorithms (e.g., randomized election timeouts)
- Failure Detectors: Use unreliable failure detectors
- Timeouts: Assume messages arrive within some time bound
Key Insight: Raft works in "partially synchronous" systems—networks may behave asynchronously for a while, but eventually become synchronous.
Real-World Consensus Scenarios
Scenario 1: Distributed Configuration
All nodes must agree on cluster membership:
sequenceDiagram
autonumber
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
participant N4 as New Node
N4->>N1: Request to join
N1->>N2: Propose add Node 4
N1->>N3: Propose add Node 4
N2->>N1: Vote YES
N3->>N1: Vote YES
N1->>N2: Commit: add Node 4
N1->>N3: Commit: add Node 4
N1->>N4: You're in!
Note over N1,N4: All nodes now agree<br/>cluster has 4 members
Scenario 2: Replicated State Machine
All replicas must apply operations in the same order:
graph LR
C[Client] --> L[Leader]
subgraph "Replicated Log"
L1[Leader: SET x=1]
F1[Follower 1: SET x=1]
F2[Follower 2: SET x=1]
F3[Follower 3: SET x=1]
L1 --- F1 --- F2 --- F3
end
subgraph "State Machine"
S1[Leader: x=1]
S2[Follower 1: x=1]
S3[Follower 2: x=1]
S4[Follower 3: x=1]
end
L --> L1
F1 --> S2
F2 --> S3
F3 --> S4
Consensus Algorithms: Raft vs Paxos
Paxos (1998)
Paxos was the first practical consensus algorithm, but it's notoriously difficult to understand:
Phase 1a (Prepare): Proposer chooses proposal number n, sends Prepare(n)
Phase 1b (Promise): Acceptor promises not to accept proposals < n
Phase 2a (Accept): Proposer sends Accept(n, value)
Phase 2b (Accepted): Acceptor accepts if no higher proposal seen
Pros:
- Proven correct
- Handles any number of failures
- Minimal message complexity
Cons:
- Extremely difficult to understand
- Hard to implement correctly
- Multi-Paxos adds complexity
- No leader by default
Raft (2014)
Raft was designed specifically for understandability:
graph TB
subgraph "Raft Components"
LE[Leader Election]
LR[Log Replication]
SM[State Machine]
Safety[Safety Properties]
LE --> LR
LR --> SM
Safety --> LE
Safety --> LR
end
Pros:
- Designed for understandability
- Clear separation of concerns
- Strong leader simplifies logic
- Practical implementation guidance
- Widely adopted
Cons:
- Leader can be bottleneck
- Not as optimized as Multi-Paxos variants
When Do You Need Consensus?
Use consensus when:
| Scenario | Need Consensus? | Reason |
|---|---|---|
| Single-node database | No | No distributed state |
| Multi-master replication | Yes | Must agree on write order |
| Leader election | Yes | Must agree on who is leader |
| Configuration management | Yes | All nodes need same config |
| Distributed lock service | Yes | Must agree on lock holder |
| Load balancer state | No | Stateless, can be rebuilt |
| Cache invalidation | Sometimes | Depends on consistency needs |
When You DON'T Need Consensus
- Read-only systems: No state to agree on
- Eventual consistency is enough: Last-write-wins suffices
- Conflict-free replicated data types (CRDTs): Mathematically resolve conflicts
- Single source of truth: Centralized authority
Simple Consensus Example
Let's look at a simplified consensus scenario: agreeing on a counter value.
TypeScript Example
// A simple consensus simulation
interface Proposal {
value: number;
proposerId: string;
}
class ConsensusNode {
private proposals: Map<string, Proposal> = new Map();
private decidedValue?: number;
private nodeId: string;
constructor(nodeId: string) {
this.nodeId = nodeId;
}
// Propose a value
propose(value: number): void {
const proposal: Proposal = {
value,
proposerId: this.nodeId
};
this.proposals.set(this.nodeId, proposal);
this.broadcastProposal(proposal);
}
// Receive a proposal from another node
receiveProposal(proposal: Proposal): void {
this.proposals.set(proposal.proposerId, proposal);
this.checkConsensus();
}
// Check if we have consensus
private checkConsensus(): void {
if (this.decidedValue !== undefined) return;
const values = Array.from(this.proposals.values()).map(p => p.value);
const counts = new Map<number, number>();
for (const value of values) {
counts.set(value, (counts.get(value) || 0) + 1);
}
// Simple majority consensus
for (const [value, count] of counts.entries()) {
if (count > Math.floor(this.proposals.size / 2)) {
this.decidedValue = value;
console.log(`Node ${this.nodeId} decided on value: ${value}`);
return;
}
}
}
private broadcastProposal(proposal: Proposal): void {
// In a real system, this would send to other nodes
console.log(`Node ${this.nodeId} broadcasting proposal: ${proposal.value}`);
}
}
// Example usage
const node1 = new ConsensusNode('node1');
const node2 = new ConsensusNode('node2');
const node3 = new ConsensusNode('node3');
node1.propose(42);
node2.propose(42);
node3.propose(99); // Minority, should lose
Python Example
from dataclasses import dataclass
from typing import Optional, Dict
import random
@dataclass
class Proposal:
value: int
proposer_id: str
class ConsensusNode:
def __init__(self, node_id: str):
self.node_id = node_id
self.proposals: Dict[str, Proposal] = {}
self.decided_value: Optional[int] = None
def propose(self, value: int) -> None:
"""Propose a value to the group."""
proposal = Proposal(value, self.node_id)
self.proposals[self.node_id] = proposal
self._broadcast_proposal(proposal)
self._check_consensus()
def receive_proposal(self, proposal: Proposal) -> None:
"""Receive a proposal from another node."""
self.proposals[proposal.proposer_id] = proposal
self._check_consensus()
def _check_consensus(self) -> None:
"""Check if we have consensus on a value."""
if self.decided_value is not None:
return
if not self.proposals:
return
# Count occurrences of each value
counts = {}
for proposal in self.proposals.values():
counts[proposal.value] = counts.get(proposal.value, 0) + 1
# Simple majority consensus
total_nodes = len(self.proposals)
for value, count in counts.items():
if count > total_nodes // 2:
self.decided_value = value
print(f"Node {self.node_id} decided on value: {value}")
return
def _broadcast_proposal(self, proposal: Proposal) -> None:
"""Broadcast proposal to other nodes."""
print(f"Node {self.node_id} broadcasting proposal: {proposal.value}")
# Example usage
if __name__ == "__main__":
node1 = ConsensusNode("node1")
node2 = ConsensusNode("node2")
node3 = ConsensusNode("node3")
node1.propose(42)
node2.propose(42)
node3.propose(99) # Minority, should lose
Common Pitfalls
| Pitfall | Description | Solution |
|---|---|---|
| Split Brain | Multiple leaders think they're in charge | Use quorum-based voting |
| Stale Reads | Reading from nodes that haven't received updates | Read from leader or use quorum reads |
| Network Partition Handling | Nodes can't communicate but continue operating | Require quorum for operations |
| Partial Failures | Some nodes fail, others continue | Design for fault tolerance |
| Clock Skew | Different clocks cause ordering issues | Use logical clocks (Lamport timestamps) |
Summary
Key Takeaways
- Consensus is the problem of getting multiple distributed nodes to agree on a single value
- Safety ensures nothing bad happens (agreement, validity, integrity)
- Liveness ensures something good happens (termination, progress)
- FLP Impossibility proves consensus is impossible in pure asynchronous systems
- Real systems work around FLP using partial synchrony and timeouts
- Raft was designed for understandability, unlike the complex Paxos algorithm
Next Session
In the next session, we'll dive into the Raft algorithm itself:
- Raft's design philosophy
- Node states (Follower, Candidate, Leader)
- How leader election works
- How log replication maintains consistency
Exercises
-
Safety vs Liveness: Give an example of a system that prioritizes safety over liveness, and one that does the opposite.
-
FLP Scenario: Describe a scenario where FLP would cause problems in a real distributed system.
-
Consensus Need: For each of these systems, explain whether they need consensus and why:
- A distributed key-value store
- A CDN (content delivery network)
- A distributed task queue
- A blockchain system
-
Simple Consensus: Extend the simple consensus example to handle node failures (a node stops responding).
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
The Raft Algorithm
Session 9, Part 1 - 25 minutes
Learning Objectives
- Understand Raft's design philosophy
- Learn the three states of a Raft node
- Explore how Raft handles consensus through leader election and log replication
- Understand the concept of terms in Raft
- Learn Raft's safety properties
Raft Design Philosophy
Raft was designed by Diego Ongaro and John Ousterhout in 2014 with a specific goal: understandability. Unlike Paxos, which was notoriously difficult to understand and implement correctly, Raft separates the consensus problem into clear, manageable subproblems.
Core Design Principles
- Strong Leader: Raft uses a strong leader approach—all log entries flow through the leader
- Leader Completeness: Once a log entry is committed, it stays in the log of all future leaders
- Decomposition: Break consensus into three subproblems:
- Leader election
- Log replication
- Safety
Why "Raft"?
The name is an analogy: a raft (the algorithm) keeps all nodes (logs) together and moving in the same direction, just like a raft keeps people together on water.
Raft Overview
graph TB
subgraph "Raft Consensus"
Client[Client]
subgraph "Cluster"
L[Leader]
F1[Follower 1]
F2[Follower 2]
F3[Follower 3]
L --> F1
L --> F2
L --> F3
end
Client -->|Write Request| L
L -->|AppendEntries| F1 & F2 & F3
F1 & F2 & F3 -->|Ack| L
L -->|Response| Client
end
Key Concepts
| Concept | Description |
|---|---|
| Leader | The only node that handles client requests and appends entries to the log |
| Follower | Passive nodes that replicate the leader's log |
| Candidate | A node campaigning to become leader during an election |
| Term | A logical clock divided into terms of arbitrary length |
| Log | A sequence of entries containing commands to apply to the state machine |
Node States
Each Raft node can be in one of three states:
stateDiagram-v2
[*] --> Follower: Node starts
Follower --> Candidate: Election timeout expires<br/>no valid RPC received
Candidate --> Leader: Receives votes from majority
Candidate --> Follower: Discovers current leader<br/>or higher term
Leader --> Follower: Discovers higher term
Follower --> Follower: Receives valid AppendEntries/RPC<br/>from leader or candidate
note right of Follower
- Responds to RPCs
- No outgoing RPCs
- Election timeout running
end note
note right of Candidate
- Requesting votes
- Election timeout running
- Can become leader or follower
end note
note right of Leader
- Handles all client requests
- Sends heartbeats to followers
- No timeout (active)
end note
State Descriptions
Follower
- Default state for all nodes
- Passively receives entries from the leader
- Responds to RPCs (RequestVote, AppendEntries)
- If no communication for election timeout, becomes candidate
Candidate
- Campaigning to become leader
- Increments current term
- Votes for itself
- Sends RequestVote RPCs to all other nodes
- Becomes leader if it receives votes from majority
- Returns to follower if it discovers current leader or higher term
Leader
- Handles all client requests
- Sends AppendEntries RPCs to all followers (heartbeats)
- Commits entries once replicated to majority
- Steps down if it discovers a higher term
Terms
A term is Raft's logical time mechanism:
timeline
title Raft Terms
Term 1 : Leader A elected
: Normal operation
: Leader A crashes
Term 2 : Election begins
: Split vote!
: Timeout, new election
Term 3 : Leader B elected
: Normal operation
Term Properties
- Monotonically Increasing: Terms always go up, never down
- Current Term: Each node stores the current term number
- Term Transitions:
- Nodes increment term when becoming candidate
- Nodes update term when receiving higher-term message
- When term changes, node becomes follower
Term in Messages
sequenceDiagram
participant C as Candidate
participant F1 as Follower (term=3)
participant F2 as Follower (term=4)
C->>F1: RequestVote(term=5)
Note over F1: Sees higher term
F1-->>C: Vote YES (updates to term=5)
C->>F2: RequestVote(term=5)
Note over F2: Already at higher term
F2-->>C: Vote NO (my term is higher)
Raft's Two-Phase Approach
Raft achieves consensus through two main phases:
Phase 1: Leader Election
sequenceDiagram
autonumber
participant F1 as Follower 1
participant F2 as Follower 2
participant F3 as Follower 3
Note over F1,F3: Election timeout expires
F1->>F1: Becomes Candidate (term=1)
F1->>F2: RequestVote(term=1)
F1->>F3: RequestVote(term=1)
F2-->>F1: Grant vote (term=1)
F3-->>F1: Grant vote (term=1)
Note over F1: Won majority!
F1->>F1: Becomes Leader
F1->>F2: AppendEntries (heartbeat)
F1->>F3: AppendEntries (heartbeat)
Phase 2: Log Replication
sequenceDiagram
autonumber
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: SET x=5
L->>L: Append to log (index=10, term=1)
L->>F1: AppendEntries(entry: SET x=5)
L->>F2: AppendEntries(entry: SET x=5)
F1-->>L: Success (replicated)
F2-->>L: Success (replicated)
Note over L: Majority replicated!<br/>Commit entry
L->>L: Apply to state machine: x=5
L-->>C: Response: OK
Safety Properties
Raft guarantees several important safety properties:
1. Election Safety
At most one leader can be elected per term.
How: Each node votes at most once per term, and a candidate needs majority of votes.
graph TB
subgraph "Same Term - Only One Leader"
T[Term 5]
C1[Candidate A: 2 votes]
C2[Candidate B: 1 vote]
C1 -->|wins majority| L[Leader A]
style L fill:#90EE90
end
2. Leader Append-Only
A leader never overwrites or deletes entries in its log; it only appends.
How: Leaders always append new entries to the end of their log.
3. Log Matching
If two logs contain an entry with the same index and term, then all preceding entries are identical.
graph LR
subgraph "Leader's Log"
L1[index 1, term 1: SET a=1]
L2[index 2, term 1: SET b=2]
L3[index 3, term 2: SET c=3]
L1 --> L2 --> L3
end
subgraph "Follower's Log"
F1[index 1, term 1: SET a=1]
F2[index 2, term 1: SET b=2]
F3[index 3, term 2: SET c=3]
F4[index 4, term 2: SET d=4]
F1 --> F2 --> F3 --> F4
end
Match[Entries 1-3 match!<br/>Follower may have extra]
4. Leader Completeness
If a log entry is committed in a given term, it will be present in the logs of all leaders for higher terms.
How: A candidate must have all committed entries before it can win an election.
5. State Machine Safety
If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
Raft RPCs
Raft uses two main RPC types:
RequestVote RPC
interface RequestVoteArgs {
term: number; // Candidate's term
candidateId: string; // Candidate requesting vote
lastLogIndex: number; // Index of candidate's last log entry
lastLogTerm: number; // Term of candidate's last log entry
}
interface RequestVoteReply {
term: number; // Current term (for candidate to update)
voteGranted: boolean; // True if candidate received vote
}
Voting Rules:
- If
term < currentTerm: deny vote - If
votedForis null orcandidateId: grant vote - If candidate's log is at least as up-to-date: grant vote
AppendEntries RPC
interface AppendEntriesArgs {
term: number; // Leader's term
leaderId: string; // So follower can redirect clients
prevLogIndex: number; // Index of log entry preceding new ones
prevLogTerm: number; // Term of prevLogIndex entry
entries: LogEntry[]; // Log entries to store (empty for heartbeat)
leaderCommit: number; // Leader's commit index
}
interface AppendEntriesReply {
term: number; // Current term (for leader to update)
success: boolean; // True if follower had entry matching prevLogIndex
}
Used for both:
- Log replication: Sending new entries
- Heartbeats: Empty entries to maintain authority
Log Completeness Property
When voting, nodes compare log completeness:
graph TB
subgraph "Comparing Logs"
A[Candidate Log]
B[Follower Log]
A --> A1[Last index: 10, term: 5]
B --> B1[Last index: 9, term: 5]
Result[A's log is more up-to-date<br/>because index 10 > 9]
end
subgraph "Tie-Breaking Rule"
C[Candidate: last term=5]
D[Follower: last term=6]
Result2[Follower is more up-to-date<br/>because term 6 > 5]
end
Up-to-date comparison:
- Compare the term of the last entries
- If terms differ, the log with the higher term is more up-to-date
- If terms are same, the log with the longer length is more up-to-date
Randomized Election Timeouts
Raft uses randomized election timeouts to prevent split votes:
timeline
title Randomized Timeouts Prevent Split Votes
Node1 : 150ms timeout
Node2 : 300ms timeout
Node3 : 200ms timeout
Node1 : Timeout! Becomes candidate
Node1 : Wins election before Node2/3 timeout
Node2 & Node3 : Receive heartbeat, reset timeouts
Without randomization: All followers timeout simultaneously → all become candidates → split vote → no leader elected.
With randomization: Only one follower times out first → becomes candidate → likely to win election.
TypeScript Implementation Structure
// Type definitions for Raft
type NodeState = 'follower' | 'candidate' | 'leader';
interface LogEntry {
index: number;
term: number;
command: { key: string; value: any };
}
interface RaftNode {
// Persistent state
currentTerm: number;
votedFor: string | null;
log: LogEntry[];
// Volatile state
commitIndex: number;
lastApplied: number;
state: NodeState;
// Leader-only volatile state
nextIndex: number[];
matchIndex: number[];
}
class RaftNodeImpl implements RaftNode {
currentTerm: number = 0;
votedFor: string | null = null;
log: LogEntry[] = [];
commitIndex: number = 0;
lastApplied: number = 0;
state: NodeState = 'follower';
nextIndex: number[] = [];
matchIndex: number[] = [];
// Handle RequestVote RPC
requestVote(args: RequestVoteArgs): RequestVoteReply {
if (args.term > this.currentTerm) {
this.currentTerm = args.term;
this.state = 'follower';
this.votedFor = null;
}
const logOk = this.isLogAtLeastAsUpToDate(args.lastLogIndex, args.lastLogTerm);
const voteOk = (this.votedFor === null || this.votedFor === args.candidateId);
if (args.term === this.currentTerm && voteOk && logOk) {
this.votedFor = args.candidateId;
return { term: this.currentTerm, voteGranted: true };
}
return { term: this.currentTerm, voteGranted: false };
}
// Handle AppendEntries RPC
appendEntries(args: AppendEntriesArgs): AppendEntriesReply {
if (args.term > this.currentTerm) {
this.currentTerm = args.term;
this.state = 'follower';
}
if (args.term !== this.currentTerm) {
return { term: this.currentTerm, success: false };
}
// Check if log has entry at prevLogIndex with prevLogTerm
if (this.log[args.prevLogIndex]?.term !== args.prevLogTerm) {
return { term: this.currentTerm, success: false };
}
// Append new entries
for (const entry of args.entries) {
this.log[entry.index] = entry;
}
// Update commit index
if (args.leaderCommit > this.commitIndex) {
this.commitIndex = Math.min(args.leaderCommit, this.log.length - 1);
}
return { term: this.currentTerm, success: true };
}
private isLogAtLeastAsUpToDate(lastLogIndex: number, lastLogTerm: number): boolean {
const myLastEntry = this.log[this.log.length - 1];
const myLastTerm = myLastEntry?.term ?? 0;
const myLastIndex = this.log.length - 1;
if (lastLogTerm !== myLastTerm) {
return lastLogTerm > myLastTerm;
}
return lastLogIndex >= myLastIndex;
}
}
Python Implementation Structure
from dataclasses import dataclass, field
from typing import Optional, List
from enum import Enum
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class LogEntry:
index: int
term: int
command: dict
@dataclass
class RequestVoteArgs:
term: int
candidate_id: str
last_log_index: int
last_log_term: int
@dataclass
class RequestVoteReply:
term: int
vote_granted: bool
@dataclass
class AppendEntriesArgs:
term: int
leader_id: str
prev_log_index: int
prev_log_term: int
entries: List[LogEntry]
leader_commit: int
@dataclass
class AppendEntriesReply:
term: int
success: bool
class RaftNode:
def __init__(self, node_id: str, peers: List[str]):
# Persistent state
self.current_term: int = 0
self.voted_for: Optional[str] = None
self.log: List[LogEntry] = []
# Volatile state
self.commit_index: int = 0
self.last_applied: int = 0
self.state: NodeState = NodeState.FOLLOWER
# Leader-only state
self.next_index: dict[str, int] = {}
self.match_index: dict[str, int] = {}
self.node_id = node_id
self.peers = peers
def request_vote(self, args: RequestVoteArgs) -> RequestVoteReply:
"""Handle RequestVote RPC."""
if args.term > self.current_term:
self.current_term = args.term
self.state = NodeState.FOLLOWER
self.voted_for = None
log_ok = self._is_log_at_least_as_up_to_date(
args.last_log_index, args.last_log_term
)
vote_ok = (self.voted_for is None or self.voted_for == args.candidate_id)
if args.term == self.current_term and vote_ok and log_ok:
self.voted_for = args.candidate_id
return RequestVoteReply(self.current_term, True)
return RequestVoteReply(self.current_term, False)
def append_entries(self, args: AppendEntriesArgs) -> AppendEntriesReply:
"""Handle AppendEntries RPC."""
if args.term > self.current_term:
self.current_term = args.term
self.state = NodeState.FOLLOWER
if args.term != self.current_term:
return AppendEntriesReply(self.current_term, False)
# Check if log has entry at prev_log_index with prev_log_term
if len(self.log) <= args.prev_log_index:
return AppendEntriesReply(self.current_term, False)
if self.log[args.prev_log_index].term != args.prev_log_term:
return AppendEntriesReply(self.current_term, False)
# Append new entries
for entry in args.entries:
if len(self.log) > entry.index:
if self.log[entry.index].term != entry.term:
# Conflict: delete from this point
self.log = self.log[:entry.index]
if len(self.log) <= entry.index:
self.log.append(entry)
# Update commit index
if args.leader_commit > self.commit_index:
self.commit_index = min(args.leader_commit, len(self.log) - 1)
return AppendEntriesReply(self.current_term, True)
def _is_log_at_least_as_up_to_date(self, last_index: int, last_term: int) -> bool:
"""Check if candidate's log is at least as up-to-date as ours."""
if not self.log:
return True
my_last_entry = self.log[-1]
my_last_term = my_last_entry.term
my_last_index = len(self.log) - 1
if last_term != my_last_term:
return last_term > my_last_term
return last_index >= my_last_index
Summary
Key Takeaways
- Raft was designed for understandability, separating consensus into clear subproblems
- Three node states: Follower → Candidate → Leader
- Terms provide a logical clock and prevent stale leaders
- Two main RPCs: RequestVote (election) and AppendEntries (replication + heartbeat)
- Randomized timeouts prevent split votes during elections
- Five safety properties guarantee correctness: election safety, append-only, log matching, leader completeness, and state machine safety
Next Session
In the next session, we'll dive into Leader Election:
- How elections are triggered
- The election algorithm in detail
- Handling split votes
- Leader election examples
Exercises
-
State Transitions: Draw the state transition diagram for a node that starts as follower, becomes candidate, wins election as leader, then discovers a higher term.
-
Term Logic: If a node receives an AppendEntries with term=7 but its current term is 9, what should it do?
-
Log Comparison: Compare these two logs and determine which is more up-to-date:
- Log A: last index=15, last term=5
- Log B: last index=12, last term=7
-
Split Vote: Describe a scenario where a split vote could occur, and how Raft prevents it from persisting.
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
Raft Leader Election
Session 9, Part 1 - 45 minutes
Learning Objectives
- Understand how Raft elects a leader democratically
- Implement the RequestVote RPC
- Handle election timeouts and randomized intervals
- Prevent split votes with election safety
- Build a working leader election system
Concept: Democratic Leader Election
In the previous chapter, we learned about Raft's design philosophy. Now let's dive into the leader election mechanism — the democratic process by which nodes agree on who should lead.
Why Do We Need a Leader?
Without a Leader:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node A │ │ Node B │ │ Node C │
│ │ │ │ │ │
│ "I'm │ │ "No, │ │ "Both │
│ leader!" │ │ I am!" │ │ wrong!" │
└─────────┘ └─────────┘ └─────────┘
Chaos! Split brain! Confusion!
With Raft Leader Election:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node A │ │ Node B │ │ Node C │
│ │ │ │ │ │
│ "I │ │ "I │ │ "I vote │
│ vote │---> │ vote │---> │ for │
│ for B" │ │ for B" │ │ B" │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┴───────────────┘
│
▼
┌──────────┐
│ Node B │
│ = LEADER │
└──────────┘
Key Insight: Nodes vote for each other. The node with majority votes becomes leader.
State Transitions During Election
Raft nodes cycle through three states during leader election:
stateDiagram-v2
[*] --> Follower: Start
Follower --> Candidate: Election timeout
Follower --> Follower: Receive valid AppendEntries
Follower --> Follower: Discover higher term
Candidate --> Leader: Receive majority votes
Candidate --> Candidate: Split vote (timeout)
Candidate --> Follower: Discover higher term
Candidate --> Follower: Receive valid AppendEntries
Leader --> Follower: Discover higher term
note right of Follower
- Votes for at most one candidate per term
- Resets election timeout on heartbeat
end note
note right of Candidate
- Increments current term
- Votes for self
- Sends RequestVote to all nodes
- Randomized timeout prevents deadlock
end note
note right of Leader
- Sends heartbeats (empty AppendEntries)
- Handles client requests
- Replicates log entries
end note
The Election Algorithm Step by Step
Step 1: Follower Timeout
When a follower doesn't hear from a leader within the election timeout:
Time ────────────────────────────────────────────────────────>
Node A: [waiting...] [waiting...] ⏱️ TIMEOUT! → Become Candidate
Node B: [waiting...] [waiting...] [waiting...]
Node C: [waiting...] [waiting...] [waiting...]
Step 2: Become Candidate
The node transitions to candidate state:
sequenceDiagram
participant C as Candidate (Node A)
participant A as All Nodes
C->>C: Increment term (e.g., term = 4)
C->>C: Vote for self
C->>A: Send RequestVote(term=4) to all
Note over C: Wait for votes...
par Each follower processes RequestVote
A->>A: If term < currentTerm: reject
A->>A: If votedFor != null: reject
A->>A: If candidate log is up-to-date: grant vote
end
A-->>C: Send vote response
alt Majority votes received
C->>C: Become LEADER
else Split vote
C->>C: Wait for timeout, then retry
end
Step 3: RequestVote RPC
The RequestVote RPC is the ballot paper in Raft's election:
graph LR
subgraph RequestVote RPC
C[term] --> D["Candidate's term"]
E[candidateId] --> F["Node requesting vote"]
G[lastLogIndex] --> H["Index of candidate's last log entry"]
I[lastLogTerm] --> J["Term of candidate's last log entry"]
end
subgraph Response
K[term] --> L["Current term (for candidate to update)"]
M[voteGranted] --> N["true if follower voted"]
end
Voting Rule: A follower grants vote if:
- Candidate's term > follower's currentTerm, OR
- Terms equal AND follower hasn't voted yet AND candidate's log is at least as up-to-date
Randomized Election Timeouts
The Split Vote Problem
Without randomization, simultaneous elections cause deadlocks:
Bad: Fixed timeouts cause repeated split votes
Node A: timeout at T=100 → Candidate, gets 1 vote
Node B: timeout at T=100 → Candidate, gets 1 vote
Node C: timeout at T=100 → Candidate, gets 1 vote
Result: Nobody wins! Election timeout...
Same thing repeats forever!
Solution: Randomized Intervals
Each node picks a random timeout within a range:
gantt
title Election Timeouts (Randomized: 150-300ms)
dateFormat X
axisFormat %L
Node A :a1, 0, 180
Node B :b1, 0, 220
Node C :c1, 0, 160
Node A becomes Candidate :milestone, m1, 180, 0s
Node C becomes Candidate :milestone, m2, 160, 0s
Node C times out first and starts election. Node A and B reset their timeouts when they receive RequestVote, allowing Node C to gather votes.
Probability Analysis: For a cluster of N nodes with timeout range [T, 2T]:
- Probability of simultaneous timeout: ~1/N
- With 5 nodes and 150-300ms range: P < 5%
TypeScript Implementation
Let's build a working Raft leader election system:
Core Types
// types/raft.ts
export type NodeState = 'follower' | 'candidate' | 'leader';
export interface LogEntry {
index: number;
term: number;
command: unknown;
}
export interface RaftNodeConfig {
id: string;
peers: string[]; // List of peer node IDs
electionTimeoutMin: number; // Minimum timeout in ms
electionTimeoutMax: number; // Maximum timeout in ms
}
export interface RequestVoteArgs {
term: number;
candidateId: string;
lastLogIndex: number;
lastLogTerm: number;
}
export interface RequestVoteReply {
term: number;
voteGranted: boolean;
}
export interface AppendEntriesArgs {
term: number;
leaderId: string;
prevLogIndex: number;
prevLogTerm: number;
entries: LogEntry[];
leaderCommit: number;
}
export interface AppendEntriesReply {
term: number;
success: boolean;
}
Raft Node Implementation
// raft-node.ts
import { RaftNodeConfig, NodeState, LogEntry, RequestVoteArgs, RequestVoteReply } from './types';
export class RaftNode {
private state: NodeState = 'follower';
private currentTerm: number = 0;
private votedFor: string | null = null;
private log: LogEntry[] = [];
// Election timeout
private electionTimer: NodeJS.Timeout | null = null;
private lastHeartbeat: number = Date.now();
// Leader-only state
private leaderId: string | null = null;
constructor(private config: RaftNodeConfig) {
this.startElectionTimer();
}
// ========== Public API ==========
getState(): NodeState {
return this.state;
}
getCurrentTerm(): number {
return this.currentTerm;
}
getLeader(): string | null {
return this.leaderId;
}
// ========== RPC Handlers ==========
/**
* Invoked by candidates to gather votes
*/
requestVote(args: RequestVoteArgs): RequestVoteReply {
const reply: RequestVoteReply = {
term: this.currentTerm,
voteGranted: false
};
// Rule 1: If candidate's term is lower, reject
if (args.term < this.currentTerm) {
return reply;
}
// Rule 2: If candidate's term is higher, update and become follower
if (args.term > this.currentTerm) {
this.becomeFollower(args.term);
reply.term = args.term;
}
// Rule 3: If we already voted for someone else this term, reject
if (this.votedFor !== null && this.votedFor !== args.candidateId) {
return reply;
}
// Rule 4: Check if candidate's log is at least as up-to-date as ours
const lastEntry = this.log.length > 0 ? this.log[this.log.length - 1] : null;
const lastLogIndex = lastEntry ? lastEntry.index : 0;
const lastLogTerm = lastEntry ? lastEntry.term : 0;
const logIsUpToDate =
(args.lastLogTerm > lastLogTerm) ||
(args.lastLogTerm === lastLogTerm && args.lastLogIndex >= lastLogIndex);
if (!logIsUpToDate) {
return reply;
}
// Grant vote
this.votedFor = args.candidateId;
reply.voteGranted = true;
this.resetElectionTimer();
console.log(`Node ${this.config.id} voted for ${args.candidateId} in term ${args.term}`);
return reply;
}
/**
* Invoked by leader to assert authority (heartbeat or log replication)
*/
receiveHeartbeat(term: number, leaderId: string): void {
if (term >= this.currentTerm) {
if (term > this.currentTerm) {
this.becomeFollower(term);
}
this.leaderId = leaderId;
this.resetElectionTimer();
}
}
// ========== State Transitions ==========
private becomeFollower(term: number): void {
this.state = 'follower';
this.currentTerm = term;
this.votedFor = null;
this.leaderId = null;
this.resetElectionTimer();
console.log(`Node ${this.config.id} became follower in term ${term}`);
}
private becomeCandidate(): void {
this.state = 'candidate';
this.currentTerm += 1;
this.votedFor = this.config.id;
this.leaderId = null;
console.log(`Node ${this.config.id} became candidate in term ${this.currentTerm}`);
// Start election
this.startElection();
}
private becomeLeader(): void {
this.state = 'leader';
this.leaderId = this.config.id;
console.log(`Node ${this.config.id} became LEADER in term ${this.currentTerm}`);
// Start sending heartbeats
this.startHeartbeats();
}
// ========== Election Logic ==========
private startElectionTimer(): void {
if (this.electionTimer) {
clearTimeout(this.electionTimer);
}
const timeout = this.getRandomElectionTimeout();
this.electionTimer = setTimeout(() => {
// Only transition if we haven't heard from a leader
if (this.state === 'follower') {
console.log(`Node ${this.config.id} election timeout`);
this.becomeCandidate();
}
}, timeout);
}
private resetElectionTimer(): void {
this.startElectionTimer();
}
private getRandomElectionTimeout(): number {
const { electionTimeoutMin, electionTimeoutMax } = this.config;
return Math.floor(
Math.random() * (electionTimeoutMax - electionTimeoutMin + 1)
) + electionTimeoutMin;
}
private async startElection(): Promise<void> {
const args: RequestVoteArgs = {
term: this.currentTerm,
candidateId: this.config.id,
lastLogIndex: this.log.length > 0 ? this.log[this.log.length - 1].index : 0,
lastLogTerm: this.log.length > 0 ? this.log[this.log.length - 1].term : 0
};
let votesReceived = 1; // Vote for self
const majority = Math.floor(this.config.peers.length / 2) + 1;
// Send RequestVote to all peers
const promises = this.config.peers.map(peerId =>
this.sendRequestVote(peerId, args)
);
const responses = await Promise.allSettled(promises);
// Count votes
for (const result of responses) {
if (result.status === 'fulfilled' && result.value.voteGranted) {
votesReceived++;
}
}
// Check if we won
if (votesReceived >= majority && this.state === 'candidate') {
this.becomeLeader();
}
}
// ========== Network Simulation ==========
private async sendRequestVote(
peerId: string,
args: RequestVoteArgs
): Promise<RequestVoteReply> {
// In a real implementation, this would be an HTTP/gRPC call
// For this example, we simulate by calling directly
// In the full example below, we'll use actual HTTP
return {
term: 0,
voteGranted: false
};
}
private startHeartbeats(): void {
// Leader sends periodic heartbeats
// Implementation in full example
}
stop(): void {
if (this.electionTimer) {
clearTimeout(this.electionTimer);
}
}
}
HTTP Server with Raft
// server.ts
import express, { Request, Response } from 'express';
import { RaftNode } from './raft-node';
import { RequestVoteArgs, RequestVoteReply } from './types';
export class RaftServer {
private app: express.Application;
private node: RaftNode;
private server: any;
constructor(
private nodeId: string,
private port: number,
peers: string[]
) {
this.app = express();
this.app.use(express.json());
this.node = new RaftNode({
id: nodeId,
peers: peers,
electionTimeoutMin: 150,
electionTimeoutMax: 300
});
this.setupRoutes();
}
private setupRoutes(): void {
// RequestVote RPC endpoint
this.app.post('/raft/request-vote', (req: Request, res: Response) => {
const args: RequestVoteArgs = req.body;
const reply: RequestVoteReply = this.node.requestVote(args);
res.json(reply);
});
// Heartbeat endpoint
this.app.post('/raft/heartbeat', (req: Request, res: Response) => {
const { term, leaderId } = req.body;
this.node.receiveHeartbeat(term, leaderId);
res.json({ success: true });
});
// Status endpoint
this.app.get('/status', (req: Request, res: Response) => {
res.json({
id: this.nodeId,
state: this.node.getState(),
term: this.node.getCurrentTerm(),
leader: this.node.getLeader()
});
});
}
async start(): Promise<void> {
this.server = this.app.listen(this.port, () => {
console.log(`Node ${this.nodeId} listening on port ${this.port}`);
});
}
stop(): void {
this.node.stop();
if (this.server) {
this.server.close();
}
}
getNode(): RaftNode {
return this.node;
}
}
Python Implementation
The same logic in Python:
# raft_node.py
import asyncio
import random
from dataclasses import dataclass
from enum import Enum
from typing import Optional, List
from datetime import datetime, timedelta
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class LogEntry:
index: int
term: int
command: dict
@dataclass
class RequestVoteArgs:
term: int
candidate_id: str
last_log_index: int
last_log_term: int
@dataclass
class RequestVoteReply:
term: int
vote_granted: bool
class RaftNode:
def __init__(self, node_id: str, peers: List[str],
election_timeout_min: int = 150,
election_timeout_max: int = 300):
self.node_id = node_id
self.peers = peers
# Persistent state
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: List[LogEntry] = []
# Volatile state
self.state = NodeState.FOLLOWER
self.leader_id: Optional[str] = None
# Election timeout
self.election_timeout_min = election_timeout_min
self.election_timeout_max = election_timeout_max
self.election_task: Optional[asyncio.Task] = None
self.heartbeat_task: Optional[asyncio.Task] = None
# Start election timer
self.start_election_timer()
async def request_vote(self, args: RequestVoteArgs) -> RequestVoteReply:
"""Handle RequestVote RPC from candidate"""
reply = RequestVoteReply(
term=self.current_term,
vote_granted=False
)
# Rule 1: Reject if candidate's term is lower
if args.term < self.current_term:
return reply
# Rule 2: Update to higher term and become follower
if args.term > self.current_term:
await self.become_follower(args.term)
reply.term = args.term
# Rule 3: Reject if already voted for another candidate
if self.voted_for is not None and self.voted_for != args.candidate_id:
return reply
# Rule 4: Check if candidate's log is up-to-date
last_entry = self.log[-1] if self.log else None
last_log_index = last_entry.index if last_entry else 0
last_log_term = last_entry.term if last_entry else 0
log_is_up_to_date = (
args.last_log_term > last_log_term or
(args.last_log_term == last_log_term and
args.last_log_index >= last_log_index)
)
if not log_is_up_to_date:
return reply
# Grant vote
self.voted_for = args.candidate_id
reply.vote_granted = True
self.reset_election_timer()
print(f"Node {self.node_id} voted for {args.candidate_id} in term {args.term}")
return reply
async def receive_heartbeat(self, term: int, leader_id: str):
"""Handle heartbeat from leader"""
if term >= self.current_term:
if term > self.current_term:
await self.become_follower(term)
self.leader_id = leader_id
self.reset_election_timer()
async def become_follower(self, term: int):
"""Transition to follower state"""
self.state = NodeState.FOLLOWER
self.current_term = term
self.voted_for = None
self.leader_id = None
# Cancel heartbeat task if running
if self.heartbeat_task:
self.heartbeat_task.cancel()
self.heartbeat_task = None
self.reset_election_timer()
print(f"Node {self.node_id} became follower in term {term}")
async def become_candidate(self):
"""Transition to candidate state and start election"""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.node_id
self.leader_id = None
print(f"Node {self.node_id} became candidate in term {self.current_term}")
await self.start_election()
async def become_leader(self):
"""Transition to leader state"""
self.state = NodeState.LEADER
self.leader_id = self.node_id
print(f"Node {self.node_id} became LEADER in term {self.current_term}")
self.start_heartbeats()
def start_election_timer(self):
"""Start or reset the election timeout timer"""
if self.election_task:
self.election_task.cancel()
timeout = self.get_random_election_timeout()
self.election_task = asyncio.create_task(self.election_timeout(timeout))
def reset_election_timer(self):
"""Reset the election timeout timer"""
self.start_election_timer()
def get_random_election_timeout(self) -> int:
"""Get random timeout within configured range"""
return random.randint(
self.election_timeout_min,
self.election_timeout_max
)
async def election_timeout(self, timeout_ms: int):
"""Wait for timeout, then start election if still follower"""
try:
await asyncio.sleep(timeout_ms / 1000)
if self.state == NodeState.FOLLOWER:
print(f"Node {self.node_id} election timeout")
await self.become_candidate()
except asyncio.CancelledError:
pass # Timer was reset
async def start_election(self):
"""Start leader election by sending RequestVote to all peers"""
args = RequestVoteArgs(
term=self.current_term,
candidate_id=self.node_id,
last_log_index=self.log[-1].index if self.log else 0,
last_log_term=self.log[-1].term if self.log else 0
)
votes_received = 1 # Vote for self
majority = len(self.peers) // 2 + 1
# Send RequestVote to all peers concurrently
tasks = [
self.send_request_vote(peer_id, args)
for peer_id in self.peers
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Count votes
for result in results:
if isinstance(result, RequestVoteReply) and result.vote_granted:
votes_received += 1
# Check if we won the election
if votes_received >= majority and self.state == NodeState.CANDIDATE:
await self.become_leader()
async def send_request_vote(self, peer_id: str, args: RequestVoteArgs) -> RequestVoteReply:
"""Send RequestVote RPC to peer (simulated)"""
# In real implementation, use HTTP/aiohttp
# For this example, return mock response
return RequestVoteReply(term=0, vote_granted=False)
def start_heartbeats(self):
"""Leader sends periodic heartbeats"""
if self.heartbeat_task:
self.heartbeat_task.cancel()
self.heartbeat_task = asyncio.create_task(self.send_heartbeats())
async def send_heartbeats(self):
"""Send empty AppendEntries (heartbeats) to all followers"""
while self.state == NodeState.LEADER:
for peer_id in self.peers:
# In real implementation, send HTTP POST
await asyncio.sleep(0.05) # Heartbeat interval: 50ms
def stop(self):
"""Stop the node"""
if self.election_task:
self.election_task.cancel()
if self.heartbeat_task:
self.heartbeat_task.cancel()
Flask Server with Raft
# server.py
from flask import Flask, request, jsonify
from raft_node import RaftNode, RequestVoteArgs
import asyncio
app = Flask(__name__)
class RaftServer:
def __init__(self, node_id: str, port: int, peers: list):
self.node_id = node_id
self.port = port
self.node = RaftNode(node_id, peers)
self.app = app
self.setup_routes()
def setup_routes(self):
@self.app.route('/raft/request-vote', methods=['POST'])
def request_vote():
args = RequestVoteArgs(**request.json)
reply = asyncio.run(self.node.request_vote(args))
return jsonify({
'term': reply.term,
'voteGranted': reply.vote_granted
})
@self.app.route('/raft/heartbeat', methods=['POST'])
def heartbeat():
data = request.json
asyncio.run(self.node.receive_heartbeat(
data['term'], data['leaderId']
))
return jsonify({'success': True})
@self.app.route('/status', methods=['GET'])
def status():
return jsonify({
'id': self.node_id,
'state': self.node.state.value,
'term': self.node.current_term,
'leader': self.node.leader_id
})
def run(self):
self.app.run(port=self.port, debug=False)
Docker Compose Setup
Let's deploy a 3-node Raft cluster:
# docker-compose.yml
version: '3.8'
services:
node1:
build:
context: ./examples/04-consensus
dockerfile: Dockerfile.typescript
container_name: raft-node1
environment:
- NODE_ID=node1
- PORT=3001
- PEERS=node2:3002,node3:3003
ports:
- "3001:3001"
networks:
- raft-network
node2:
build:
context: ./examples/04-consensus
dockerfile: Dockerfile.typescript
container_name: raft-node2
environment:
- NODE_ID=node2
- PORT=3002
- PEERS=node1:3001,node3:3003
ports:
- "3002:3002"
networks:
- raft-network
node3:
build:
context: ./examples/04-consensus
dockerfile: Dockerfile.typescript
container_name: raft-node3
environment:
- NODE_ID=node3
- PORT=3003
- PEERS=node1:3001,node2:3002
ports:
- "3003:3003"
networks:
- raft-network
networks:
raft-network:
driver: bridge
Running the Example
TypeScript Version
# 1. Build and start the cluster
cd distributed-systems-course/examples/04-consensus
docker-compose up
# 2. Watch the election happen in the logs
# You'll see nodes transition from follower → candidate → leader
# 3. Check the status of each node
curl http://localhost:3001/status
curl http://localhost:3002/status
curl http://localhost:3003/status
# 4. Kill the leader and watch re-election
docker-compose stop node1 # If node1 was leader
# Watch the logs to see a new leader elected!
# 5. Clean up
docker-compose down
Python Version
# 1. Build and start the cluster
cd distributed-systems-course/examples/04-consensus
docker-compose -f docker-compose.python.yml up
# 2-5. Same as above, using ports 4001-4003 for Python nodes
Exercises
Exercise 1: Observe Election Safety
Run the cluster and answer these questions:
- How long does it take for a leader to be elected?
- What happens if you start nodes at different times?
- Can you observe a split vote scenario? (Hint: cause a network partition)
Exercise 2: Implement Pre-Vote
Pre-vote is an optimization that prevents disrupting a stable leader:
- Research the pre-vote mechanism
- Modify the RequestVote handler to check if leader is alive first
- Test that pre-vote prevents unnecessary elections
Exercise 3: Election Timeout Tuning
Experiment with different timeout ranges:
- Try 50-100ms: What happens? (Hint: too many elections)
- Try 500-1000ms: What happens? (Hint: slow leader failover)
- Find the optimal range for a 3-node cluster
Exercise 4: Network Partition Simulation
Simulate a network partition:
- Start the cluster and wait for leader election
- Isolate node1 from the network (using Docker network isolation)
- Observe: Does node1 think it's still leader?
- Reconnect: Does the cluster recover correctly?
Summary
In this chapter, you learned:
- Why leader election matters: Prevents split-brain and confusion
- Raft's democratic process: Nodes vote for each other
- State transitions: Follower → Candidate → Leader
- RequestVote RPC: The ballot paper of Raft elections
- Randomized timeouts: Prevent split votes and deadlocks
- Election safety: At most one leader per term
Next Chapter: Log Replication — Once we have a leader, how do we safely replicate data across the cluster?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
Log Replication
Session 10, Part 1 - 30 minutes
Learning Objectives
- Understand how Raft replicates logs across nodes
- Learn the log matching property that ensures consistency
- Implement the AppendEntries RPC
- Handle log consistency conflicts
- Understand commit index and state machine application
Concept: Keeping Everyone in Sync
Once a leader is elected, it needs to replicate client commands to all followers. This is the log replication phase of Raft.
The Challenge
Client sends "SET x = 5" to Leader
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Leader │ │ Follower │ │ Follower │
│ │ │ A │ │ B │
└────┬─────┘ └──────────┘ └──────────┘
│
│ How do we ensure ALL nodes
│ have the SAME command log?
│
│ What if network fails?
│ What if follower crashes?
▼
┌─────────────────────────────────────────┐
│ Log Replication Protocol │
└─────────────────────────────────────────┘
Log Structure
Each node maintains a log of commands. A log entry contains:
interface LogEntry {
index: number; // Position in the log (starts at 1)
term: number; // Term when entry was received
command: string; // The actual command (e.g., "SET x = 5")
}
@dataclass
class LogEntry:
index: int # Position in the log (starts at 1)
term: int # Term when entry was received
command: str # The actual command (e.g., "SET x = 5")
Visual Log Representation
Node 1 (Leader) Node 2 (Follower) Node 3 (Follower)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Index │ Term │ Cmd│ │ Index │ Term │ Cmd│ │ Index │ Term │ Cmd│
├───────┼──────┼────┤ ├───────┼──────┼────┤ ├───────┼──────┼────┤
│ 1 │ 1 │SET │ │ 1 │ 1 │SET │ │ 1 │ 1 │SET │
│ 2 │ 2 │SET │ │ 2 │ 2 │SET │ │ 2 │ 2 │SET │
│ 3 │ 2 │SET │ │ 3 │ 2 │SET │ │ │ │ │
│ 4 │ 2 │SET │ │ │ │ │ │ │ │ │
└───────┴──────┴────┘ └───────┴──────┴────┘ └───────┴──────┴────┘
The Log Matching Property
This is Raft's key safety guarantee. If two logs contain an entry with the same index and term, then all preceding entries are identical and in the same order.
Log Matching Property
┌────────────────────────────────────────────────────────┐
│ │
│ If logs[i].term == logs[j].term AND │
│ logs[i].index == logs[j].index │
│ │
│ THEN: │
│ logs[k] == logs[k] for all k < i │
│ │
└────────────────────────────────────────────────────────┘
Example:
Node A: [1,1] [2,1] [3,2] [4,2] [5,2]
│
Node B: [1,1] [2,1] [3,2] [4,2] [5,3] [6,3]
│
└─ Same index 3, term 2
Therefore entries 1-2 are IDENTICAL
This property allows Raft to efficiently detect and fix inconsistencies.
AppendEntries RPC
The leader uses AppendEntries to replicate log entries and also as a heartbeat.
RPC Specification
interface AppendEntriesRequest {
term: number; // Leader's term
leaderId: string; // So follower can redirect clients
prevLogIndex: number; // Index of log entry immediately preceding new ones
prevLogTerm: number; // Term of prevLogIndex entry
entries: LogEntry[]; // Log entries to store (empty for heartbeat)
leaderCommit: number; // Leader's commit index
}
interface AppendEntriesResponse {
term: number; // Current term, for leader to update itself
success: boolean; // True if follower contained entry matching prevLogIndex/term
}
@dataclass
class AppendEntriesRequest:
term: int # Leader's term
leader_id: str # So follower can redirect clients
prev_log_index: int # Index of log entry immediately preceding new ones
prev_log_term: int # Term of prevLogIndex entry
entries: List[LogEntry] # Log entries to store (empty for heartbeat)
leader_commit: int # Leader's commit index
@dataclass
class AppendEntriesResponse:
term: int # Current term, for leader to update itself
success: bool # True if follower contained entry matching prevLogIndex/term
Log Replication Flow
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
participant F3 as Follower 3
C->>L: SET x = 5
L->>L: Append to log (uncommitted)
L->>F1: AppendEntries(entries=[SET x=5], prevLogIndex=2, prevLogTerm=1)
L->>F2: AppendEntries(entries=[SET x=5], prevLogIndex=2, prevLogTerm=1)
L->>F3: AppendEntries(entries=[SET x=5], prevLogIndex=2, prevLogTerm=1)
F1->>F1: Append to log, reply success
F2->>F2: Append to log, reply success
F3->>F3: Append to log, reply success
Note over L: Received majority (2/3)
L->>L: Commit index = 3
L->>L: Apply to state machine: x = 5
L->>C: Return success (x = 5)
L->>F1: AppendEntries(entries=[], leaderCommit=3)
L->>F2: AppendEntries(entries=[], leaderCommit=3)
L->>F3: AppendEntries(entries=[], leaderCommit=3)
F1->>F1: Apply committed entries
F2->>F2: Apply committed entries
F3->>F3: Apply committed entries
Handling Consistency Conflicts
When a follower's log conflicts with the leader's, the leader resolves it:
graph TD
A[Leader sends AppendEntries] --> B{Follower checks<br/>prevLogIndex/term}
B -->|Match found| C[Append new entries<br/>Return success=true]
B -->|No match| D[Return success=false]
D --> E[Leader decrements<br/>nextIndex for follower]
E --> F{Retry with<br/>earlier log position?}
F -->|Yes| A
F -->|No match at index 0| G[Append leader's<br/>entire log]
C --> H[Follower updates<br/>commit index if needed]
H --> I[Apply committed entries<br/>to state machine]
Conflict Example
Before Conflict Resolution:
Leader: [1,1] [2,2] [3,2]
Follower:[1,1] [2,1] [3,1] [4,3] ← Diverged at index 2!
Step 1: Leader sends AppendEntries(prevLogIndex=2, prevLogTerm=2)
Follower: No match! (has term 1, not 2) → Return success=false
Step 2: Leader decrements nextIndex, sends AppendEntries(prevLogIndex=1, prevLogTerm=1)
Follower: Match! → Return success=true
Step 3: Leader sends entries starting from index 2
Follower overwrites [2,1] [3,1] [4,3] with [2,2] [3,2]
After Conflict Resolution:
Leader: [1,1] [2,2] [3,2]
Follower:[1,1] [2,2] [3,2] ← Now consistent!
Commit Index
The commit index tracks which log entries are committed (durable and safe to apply).
let commitIndex = 0; // Index of highest committed entry
// Leader rule: An entry from current term is committed
// once it's stored on a majority of servers
function updateCommitIndex(): void {
const N = this.log.length;
// Find the largest N such that:
// 1. A majority of nodes have log entries up to N
// 2. log[N].term == currentTerm (safety rule!)
for (let i = N; i > this.commitIndex; i--) {
if (this.log[i - 1].term === this.currentTerm && this.isMajorityReplicated(i)) {
this.commitIndex = i;
break;
}
}
}
commit_index: int = 0 # Index of highest committed entry
# Leader rule: An entry from current term is committed
# once it's stored on a majority of servers
def update_commit_index(self) -> None:
N = len(self.log)
# Find the largest N such that:
# 1. A majority of nodes have log entries up to N
# 2. log[N].term == currentTerm (safety rule!)
for i in range(N, self.commit_index, -1):
if self.log[i - 1].term == self.current_term and self.is_majority_replicated(i):
self.commit_index = i
break
Safety Rule: Only Commit Current Term Entries
graph LR
A[Entry from<br/>previous term] -->|Can be<br/>committed| B[When current<br/>term entry exists]
C[Entry from<br/>current term] -->|Can be<br/>committed| D[When replicated<br/>to majority]
B --> E[Applied to<br/>state machine]
D --> E
style B fill:#f99
style D fill:#9f9
Why? Prevents a leader from committing uncommitted entries from previous terms that could be overwritten.
TypeScript Implementation
Let's extend our Raft implementation with log replication:
// types.ts
export interface LogEntry {
index: number;
term: number;
command: string;
}
export interface AppendEntriesRequest {
term: number;
leaderId: string;
prevLogIndex: number;
prevLogTerm: number;
entries: LogEntry[];
leaderCommit: number;
}
export interface AppendEntriesResponse {
term: number;
success: boolean;
}
// raft-node.ts
export class RaftNode {
private log: LogEntry[] = [];
private commitIndex = 0;
private lastApplied = 0;
// For each follower, track next log index to send
private nextIndex: Map<string, number> = new Map();
private matchIndex: Map<string, number> = new Map();
// ... (previous code from leader election)
/**
* Handle AppendEntries RPC from leader
*/
handleAppendEntries(req: AppendEntriesRequest): AppendEntriesResponse {
// Reply false if term < currentTerm
if (req.term < this.currentTerm) {
return { term: this.currentTerm, success: false };
}
// Update current term if needed
if (req.term > this.currentTerm) {
this.currentTerm = req.term;
this.state = NodeState.Follower;
this.votedFor = null;
}
// Reset election timeout
this.resetElectionTimeout();
// Check log consistency
if (req.prevLogIndex > 0) {
if (this.log.length < req.prevLogIndex) {
return { term: this.currentTerm, success: false };
}
const prevEntry = this.log[req.prevLogIndex - 1];
if (prevEntry.term !== req.prevLogTerm) {
return { term: this.currentTerm, success: false };
}
}
// Append new entries
if (req.entries.length > 0) {
// Find first conflicting entry
let insertIndex = req.prevLogIndex;
for (const entry of req.entries) {
if (insertIndex < this.log.length) {
const existing = this.log[insertIndex];
if (existing.index === entry.index && existing.term === entry.term) {
// Already matches, skip
insertIndex++;
continue;
}
// Conflict! Delete from here and append
this.log = this.log.slice(0, insertIndex);
}
this.log.push(entry);
insertIndex++;
}
}
// Update commit index
if (req.leaderCommit > this.commitIndex) {
this.commitIndex = Math.min(req.leaderCommit, this.log.length);
this.applyCommittedEntries();
}
return { term: this.currentTerm, success: true };
}
/**
* Apply committed entries to state machine
*/
private applyCommittedEntries(): void {
while (this.lastApplied < this.commitIndex) {
this.lastApplied++;
const entry = this.log[this.lastApplied - 1];
this.stateMachine.apply(entry);
console.log(`Node ${this.nodeId} applied: ${entry.command}`);
}
}
/**
* Leader: replicate log to followers
*/
private replicateLog(): void {
if (this.state !== NodeState.Leader) return;
for (const followerId of this.clusterConfig.peerIds) {
const nextIdx = this.nextIndex.get(followerId) || 1;
const prevLogIndex = nextIdx - 1;
const prevLogTerm = prevLogIndex > 0 ? this.log[prevLogIndex - 1].term : 0;
const entries = this.log.slice(nextIdx - 1);
const req: AppendEntriesRequest = {
term: this.currentTerm,
leaderId: this.nodeId,
prevLogIndex,
prevLogTerm,
entries,
leaderCommit: this.commitIndex,
};
this.sendAppendEntries(followerId, req);
}
}
/**
* Leader: handle AppendEntries response
*/
private handleAppendEntriesResponse(
followerId: string,
resp: AppendEntriesResponse,
req: AppendEntriesRequest
): void {
if (this.state !== NodeState.Leader) return;
if (resp.term > this.currentTerm) {
// Follower has higher term, step down
this.currentTerm = resp.term;
this.state = NodeState.Follower;
this.votedFor = null;
return;
}
if (resp.success) {
// Update match index and next index
const lastIndex = req.prevLogIndex + req.entries.length;
this.matchIndex.set(followerId, lastIndex);
this.nextIndex.set(followerId, lastIndex + 1);
// Try to commit more entries
this.updateCommitIndex();
} else {
// Follower's log is inconsistent, backtrack
const currentNext = this.nextIndex.get(followerId) || 1;
this.nextIndex.set(followerId, Math.max(1, currentNext - 1));
// Retry immediately
setTimeout(() => this.replicateLog(), 50);
}
}
/**
* Leader: update commit index if majority has entry
*/
private updateCommitIndex(): void {
if (this.state !== NodeState.Leader) return;
const N = this.log.length;
// Find the largest N such that a majority have log entries up to N
for (let i = N; i > this.commitIndex; i--) {
if (this.log[i - 1].term !== this.currentTerm) {
// Only commit entries from current term
continue;
}
let count = 1; // Leader has it
for (const matchIdx of this.matchIndex.values()) {
if (matchIdx >= i) count++;
}
const majority = Math.floor(this.clusterConfig.peerIds.length / 2) + 1;
if (count >= majority) {
this.commitIndex = i;
this.applyCommittedEntries();
break;
}
}
}
/**
* Client: submit command to cluster
*/
async submitCommand(command: string): Promise<void> {
if (this.state !== NodeState.Leader) {
throw new Error('Not a leader. Redirect to actual leader.');
}
// Append to local log
const entry: LogEntry = {
index: this.log.length + 1,
term: this.currentTerm,
command,
};
this.log.push(entry);
// Replicate to followers
this.replicateLog();
// Wait for commit
await this.waitForCommit(entry.index);
}
private async waitForCommit(index: number): Promise<void> {
return new Promise((resolve) => {
const check = () => {
if (this.commitIndex >= index) {
resolve();
} else {
setTimeout(check, 50);
}
};
check();
});
}
}
Python Implementation
# types.py
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class LogEntry:
index: int
term: int
command: str
@dataclass
class AppendEntriesRequest:
term: int
leader_id: str
prev_log_index: int
prev_log_term: int
entries: List[LogEntry]
leader_commit: int
@dataclass
class AppendEntriesResponse:
term: int
success: bool
# raft_node.py
import asyncio
from enum import Enum
from typing import List, Dict, Optional
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
class RaftNode:
def __init__(self, node_id: str, peer_ids: List[str]):
self.node_id = node_id
self.peer_ids = peer_ids
# Persistent state
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: List[LogEntry] = []
# Volatile state
self.commit_index = 0
self.last_applied = 0
self.state = NodeState.FOLLOWER
# Leader state
self.next_index: Dict[str, int] = {}
self.match_index: Dict[str, int] = {}
# State machine
self.state_machine = StateMachine()
# Election timeout
self.election_timeout: Optional[asyncio.Task] = None
async def handle_append_entries(self, req: AppendEntriesRequest) -> AppendEntriesResponse:
"""Handle AppendEntries RPC from leader"""
# Reply false if term < currentTerm
if req.term < self.current_term:
return AppendEntriesResponse(term=self.current_term, success=False)
# Update current term if needed
if req.term > self.current_term:
self.current_term = req.term
self.state = NodeState.FOLLOWER
self.voted_for = None
# Reset election timeout
self.reset_election_timeout()
# Check log consistency
if req.prev_log_index > 0:
if len(self.log) < req.prev_log_index:
return AppendEntriesResponse(term=self.current_term, success=False)
prev_entry = self.log[req.prev_log_index - 1]
if prev_entry.term != req.prev_log_term:
return AppendEntriesResponse(term=self.current_term, success=False)
# Append new entries
if req.entries:
# Find first conflicting entry
insert_index = req.prev_log_index
for entry in req.entries:
if insert_index < len(self.log):
existing = self.log[insert_index]
if existing.index == entry.index and existing.term == entry.term:
# Already matches, skip
insert_index += 1
continue
# Conflict! Delete from here and append
self.log = self.log[:insert_index]
self.log.append(entry)
insert_index += 1
# Update commit index
if req.leader_commit > self.commit_index:
self.commit_index = min(req.leader_commit, len(self.log))
await self.apply_committed_entries()
return AppendEntriesResponse(term=self.current_term, success=True)
async def apply_committed_entries(self):
"""Apply committed entries to state machine"""
while self.last_applied < self.commit_index:
self.last_applied += 1
entry = self.log[self.last_applied - 1]
self.state_machine.apply(entry)
print(f"Node {self.node_id} applied: {entry.command}")
async def replicate_log(self):
"""Leader: replicate log to followers"""
if self.state != NodeState.LEADER:
return
for follower_id in self.peer_ids:
next_idx = self.next_index.get(follower_id, 1)
prev_log_index = next_idx - 1
prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
entries = self.log[next_idx - 1:]
req = AppendEntriesRequest(
term=self.current_term,
leader_id=self.node_id,
prev_log_index=prev_log_index,
prev_log_term=prev_log_term,
entries=entries,
leader_commit=self.commit_index
)
await self.send_append_entries(follower_id, req)
async def handle_append_entries_response(
self,
follower_id: str,
resp: AppendEntriesResponse,
req: AppendEntriesRequest
):
"""Leader: handle AppendEntries response"""
if self.state != NodeState.LEADER:
return
if resp.term > self.current_term:
# Follower has higher term, step down
self.current_term = resp.term
self.state = NodeState.FOLLOWER
self.voted_for = None
return
if resp.success:
# Update match index and next index
last_index = req.prev_log_index + len(req.entries)
self.match_index[follower_id] = last_index
self.next_index[follower_id] = last_index + 1
# Try to commit more entries
await self.update_commit_index()
else:
# Follower's log is inconsistent, backtrack
current_next = self.next_index.get(follower_id, 1)
self.next_index[follower_id] = max(1, current_next - 1)
# Retry immediately
asyncio.create_task(self.replicate_log())
async def update_commit_index(self):
"""Leader: update commit index if majority has entry"""
if self.state != NodeState.LEADER:
return
N = len(self.log)
# Find the largest N such that a majority have log entries up to N
for i in range(N, self.commit_index, -1):
if self.log[i - 1].term != self.current_term:
# Only commit entries from current term
continue
count = 1 # Leader has it
for match_idx in self.match_index.values():
if match_idx >= i:
count += 1
majority = len(self.peer_ids) // 2 + 1
if count >= majority:
self.commit_index = i
await self.apply_committed_entries()
break
async def submit_command(self, command: str) -> None:
"""Client: submit command to cluster"""
if self.state != NodeState.LEADER:
raise Exception("Not a leader. Redirect to actual leader.")
# Append to local log
entry = LogEntry(
index=len(self.log) + 1,
term=self.current_term,
command=command
)
self.log.append(entry)
# Replicate to followers
await self.replicate_log()
# Wait for commit
await self._wait_for_commit(entry.index)
async def _wait_for_commit(self, index: int):
"""Wait for an entry to be committed"""
while self.commit_index < index:
await asyncio.sleep(0.05)
# state_machine.py
class StateMachine:
"""Simple key-value store state machine"""
def __init__(self):
self.data: Dict[str, str] = {}
def apply(self, entry: LogEntry):
"""Apply a committed log entry to the state machine"""
parts = entry.command.split()
if parts[0] == "SET" and len(parts) == 3:
key, value = parts[1], parts[2]
self.data[key] = value
print(f"Applied: {key} = {value}")
elif parts[0] == "DELETE" and len(parts) == 2:
key = parts[1]
if key in self.data:
del self.data[key]
print(f"Deleted: {key}")
Testing Log Replication
TypeScript Test
// test-log-replication.ts
async function testLogReplication() {
const nodes = [
new RaftNode('node1', ['node2', 'node3']),
new RaftNode('node2', ['node1', 'node3']),
new RaftNode('node3', ['node1', 'node2']),
];
// Simulate leader election (node1 wins)
await nodes[0].becomeLeader();
// Submit command to leader
await nodes[0].submitCommand('SET x = 5');
// Verify all nodes have the entry
for (const node of nodes) {
const entry = node.getLog()[0];
console.log(`${node.nodeId}: ${entry.command}`);
}
}
Python Test
# test_log_replication.py
import asyncio
async def test_log_replication():
nodes = [
RaftNode('node1', ['node2', 'node3']),
RaftNode('node2', ['node1', 'node3']),
RaftNode('node3', ['node1', 'node2']),
]
# Simulate leader election (node1 wins)
await nodes[0].become_leader()
# Submit command to leader
await nodes[0].submit_command('SET x = 5')
# Verify all nodes have the entry
for node in nodes:
entry = node.get_log()[0]
print(f"{node.node_id}: {entry.command}")
asyncio.run(test_log_replication())
Exercises
Exercise 1: Basic Log Replication
- Start a 3-node cluster
- Elect a leader
- Submit
SET x = 10to the leader - Verify the entry is on all nodes
- Check commit index advancement
Expected Result: The entry appears on all nodes after being committed.
Exercise 2: Conflict Resolution
- Start a 3-node cluster
- Create a log divergence (manually edit follower logs)
- Have the leader replicate new entries
- Observe how the follower's log is corrected
Expected Result: The follower's conflicting entries are overwritten.
Exercise 3: Commit Index Safety
- Start a 5-node cluster
- Partition the network (2 nodes isolated)
- Submit commands to the leader
- Verify entries are committed with majority (3 nodes)
- Heal the partition
- Verify isolated nodes catch up
Expected Result: Commands commit with 3 nodes, isolated nodes catch up after healing.
Exercise 4: State Machine Application
- Implement a key-value store state machine
- Submit multiple SET commands
- Verify the state machine applies them in order
- Kill and restart a node
- Verify the state machine is rebuilt from the log
Expected Result: State machine reflects all committed commands, even after restart.
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Committing previous term entries | Entries get lost | Only commit entries from current term |
| Not applying entries in order | Inconsistent state | Apply from lastApplied+1 to commitIndex |
| Infinite conflict resolution loop | CPU spike | Ensure nextIndex doesn't go below 1 |
| Applying uncommitted entries | Data loss on leader crash | Never apply before commitIndex |
Key Takeaways
- Log replication ensures all nodes execute the same commands in the same order
- AppendEntries RPC handles both replication and heartbeats
- Log matching property enables efficient conflict resolution
- Commit index tracks which entries are safely replicated
- State machine applies committed entries deterministically
Next: Complete Consensus System Implementation →
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
Consensus System Implementation
Session 10, Part 2 - 60 minutes
Learning Objectives
- Build a complete Raft-based consensus system
- Implement a state machine abstraction (key-value store)
- Create client APIs for get/set operations
- Deploy and test the full system with Docker Compose
- Verify safety and liveness properties
Overview: Putting It All Together
In the previous chapters, we implemented Raft's two core components:
- Leader Election (Session 9) - Democratic voting to select a leader
- Log Replication (Session 10, Part 1) - Replicating commands across nodes
Now we combine them into a complete consensus system - a distributed key-value store that provides strong consistency guarantees.
┌────────────────────────────────────────────────────────────┐
│ Raft Consensus System │
├────────────────────────────────────────────────────────────┤
│ │
│ Client ──→ Leader ──→ Log Replication ──→ Followers │
│ │ │ │ │ │
│ │ ▼ ▼ ▼ │
│ │ Leader Election (if needed) │
│ │ │ │
│ ▼ ▼ ▼
│ State Machine (all nodes apply same commands) │
│ │
└────────────────────────────────────────────────────────────┘
System Architecture
graph TB
subgraph "Client Layer"
C1[Client 1]
C2[Client 2]
end
subgraph "Raft Cluster"
N1[Node 1: Leader]
N2[Node 2: Follower]
N3[Node 3: Follower]
end
subgraph "State Machine Layer"
SM1[KV Store 1]
SM2[KV Store 2]
SM3[KV Store 3]
end
C1 -->|SET/GET| N1
C2 -->|SET/GET| N1
N1 <-->|AppendEntries RPC| N2
N1 <-->|AppendEntries RPC| N3
N2 <-->|RequestVote RPC| N3
N1 --> SM1
N2 --> SM2
N3 --> SM3
style N1 fill:#9f9
style N2 fill:#fc9
style N3 fill:#fc9
Complete TypeScript Implementation
Project Structure
typescript-raft/
├── package.json
├── tsconfig.json
├── src/
│ ├── types.ts # Shared types
│ ├── state-machine.ts # KV store state machine
│ ├── raft-node.ts # Complete Raft implementation
│ ├── server.ts # HTTP API server
│ └── index.ts # Entry point
└── docker-compose.yml
package.json
{
"name": "typescript-raft-kv-store",
"version": "1.0.0",
"description": "Distributed key-value store using Raft consensus",
"main": "dist/index.js",
"scripts": {
"build": "tsc",
"start": "node dist/index.js",
"dev": "ts-node src/index.ts"
},
"dependencies": {
"express": "^4.18.2",
"axios": "^1.6.0"
},
"devDependencies": {
"@types/express": "^4.17.21",
"@types/node": "^20.10.0",
"ts-node": "^10.9.2",
"typescript": "^5.3.3"
}
}
types.ts
// Node states
export enum NodeState {
FOLLOWER = 'follower',
CANDIDATE = 'candidate',
LEADER = 'leader'
}
// Log entry
export interface LogEntry {
index: number;
term: number;
command: string;
}
// RequestVote RPC
export interface RequestVoteRequest {
term: number;
candidateId: string;
lastLogIndex: number;
lastLogTerm: number;
}
export interface RequestVoteResponse {
term: number;
voteGranted: boolean;
}
// AppendEntries RPC
export interface AppendEntriesRequest {
term: number;
leaderId: string;
prevLogIndex: number;
prevLogTerm: number;
entries: LogEntry[];
leaderCommit: number;
}
export interface AppendEntriesResponse {
term: number;
success: boolean;
}
// Client commands
export interface SetCommand {
type: 'SET';
key: string;
value: string;
}
export interface GetCommand {
type: 'GET';
key: string;
}
export interface DeleteCommand {
type: 'DELETE';
key: string;
}
export type Command = SetCommand | GetCommand | DeleteCommand;
state-machine.ts
import { LogEntry } from './types';
/**
* Key-Value Store State Machine
* Applies committed log entries to build consistent state
*/
export class KVStoreStateMachine {
private data: Map<string, string> = new Map();
/**
* Apply a committed log entry to the state machine
*/
apply(entry: LogEntry): void {
try {
const command = JSON.parse(entry.command);
switch (command.type) {
case 'SET':
this.data.set(command.key, command.value);
console.log(`[State Machine] SET ${command.key} = ${command.value}`);
break;
case 'DELETE':
if (this.data.has(command.key)) {
this.data.delete(command.key);
console.log(`[State Machine] DELETE ${command.key}`);
}
break;
case 'GET':
// Read-only commands don't modify state
break;
default:
console.warn(`[State Machine] Unknown command type: ${command.type}`);
}
} catch (error) {
console.error(`[State Machine] Failed to apply entry:`, error);
}
}
/**
* Get a value from the state machine
*/
get(key: string): string | undefined {
return this.data.get(key);
}
/**
* Get all key-value pairs
*/
getAll(): Record<string, string> {
return Object.fromEntries(this.data);
}
/**
* Clear the state machine (for testing)
*/
clear(): void {
this.data.clear();
}
}
raft-node.ts
import {
NodeState,
LogEntry,
RequestVoteRequest,
RequestVoteResponse,
AppendEntriesRequest,
AppendEntriesResponse,
Command
} from './types';
import { KVStoreStateMachine } from './state-machine';
import axios from 'axios';
interface ClusterConfig {
nodeId: string;
peerIds: string[];
electionTimeoutMin: number;
electionTimeoutMax: number;
heartbeatInterval: number;
}
export class RaftNode {
// Configuration
private config: ClusterConfig;
// Persistent state (survives restarts)
private currentTerm: number = 0;
private votedFor: string | null = null;
private log: LogEntry[] = [];
// Volatile state (reset on restart)
private commitIndex: number = 0;
private lastApplied: number = 0;
private state: NodeState = NodeState.FOLLOWER;
// Leader state (reset on election)
private nextIndex: Map<string, number> = new Map();
private matchIndex: Map<string, number> = new Map();
// Components
private stateMachine: KVStoreStateMachine;
private leaderId: string | null = null;
// Timers
private electionTimer: NodeJS.Timeout | null = null;
private heartbeatTimer: NodeJS.Timeout | null = null;
constructor(config: ClusterConfig) {
this.config = config;
this.stateMachine = new KVStoreStateMachine();
this.resetElectionTimeout();
}
// ========== Public API ==========
/**
* Client: Submit a command to the cluster
*/
async submitCommand(command: Command): Promise<any> {
// Redirect to leader if not leader
if (this.state !== NodeState.LeADER) {
if (this.leaderId) {
throw new Error(`Not a leader. Please redirect to ${this.leaderId}`);
}
throw new Error('No leader known. Please retry.');
}
// Handle GET commands (read-only, no consensus needed)
if (command.type === 'GET') {
return this.stateMachine.get(command.key);
}
// Append to local log
const entry: LogEntry = {
index: this.log.length + 1,
term: this.currentTerm,
command: JSON.stringify(command)
};
this.log.push(entry);
// Replicate to followers
this.replicateLog();
// Wait for commit
await this.waitForCommit(entry.index);
// Return result
if (command.type === 'SET') {
return { key: command.key, value: command.value };
} else if (command.type === 'DELETE') {
return { key: command.key, deleted: true };
}
}
/**
* Start the node (begin election timeout)
*/
start(): void {
this.resetElectionTimeout();
}
/**
* Stop the node (clear timers)
*/
stop(): void {
if (this.electionTimer) clearTimeout(this.electionTimer);
if (this.heartbeatTimer) clearTimeout(this.heartbeatTimer);
}
// ========== RPC Handlers ==========
/**
* Handle RequestVote RPC
*/
handleRequestVote(req: RequestVoteRequest): RequestVoteResponse {
// If term < currentTerm, reject
if (req.term < this.currentTerm) {
return { term: this.currentTerm, voteGranted: false };
}
// If term > currentTerm, update and become follower
if (req.term > this.currentTerm) {
this.currentTerm = req.term;
this.state = NodeState.FOLLOWER;
this.votedFor = null;
}
// Grant vote if:
// 1. We haven't voted this term, OR
// 2. We voted for this candidate
// AND candidate's log is at least as up-to-date as ours
const logOk = req.lastLogTerm > this.getLastLogTerm() ||
(req.lastLogTerm === this.getLastLogTerm() && req.lastLogIndex >= this.log.length);
const canVote = this.votedFor === null || this.votedFor === req.candidateId;
if (canVote && logOk) {
this.votedFor = req.candidateId;
this.resetElectionTimeout();
return { term: this.currentTerm, voteGranted: true };
}
return { term: this.currentTerm, voteGranted: false };
}
/**
* Handle AppendEntries RPC
*/
handleAppendEntries(req: AppendEntriesRequest): AppendEntriesResponse {
// If term < currentTerm, reject
if (req.term < this.currentTerm) {
return { term: this.currentTerm, success: false };
}
// Recognize leader
this.leaderId = req.leaderId;
// If term > currentTerm, update and become follower
if (req.term > this.currentTerm) {
this.currentTerm = req.term;
this.state = NodeState.FOLLOWER;
this.votedFor = null;
}
// Reset election timeout
this.resetElectionTimeout();
// Check log consistency
if (req.prevLogIndex > 0) {
if (this.log.length < req.prevLogIndex) {
return { term: this.currentTerm, success: false };
}
const prevEntry = this.log[req.prevLogIndex - 1];
if (prevEntry.term !== req.prevLogTerm) {
return { term: this.currentTerm, success: false };
}
}
// Append new entries
if (req.entries.length > 0) {
let insertIndex = req.prevLogIndex;
for (const entry of req.entries) {
if (insertIndex < this.log.length) {
const existing = this.log[insertIndex];
if (existing.index === entry.index && existing.term === entry.term) {
insertIndex++;
continue;
}
// Conflict! Delete from here
this.log = this.log.slice(0, insertIndex);
}
this.log.push(entry);
insertIndex++;
}
}
// Update commit index
if (req.leaderCommit > this.commitIndex) {
this.commitIndex = Math.min(req.leaderCommit, this.log.length);
this.applyCommittedEntries();
}
return { term: this.currentTerm, success: true };
}
// ========== Private Methods ==========
/**
* Start election (convert to candidate)
*/
private startElection(): void {
this.state = NodeState.CANDIDATE;
this.currentTerm++;
this.votedFor = this.config.nodeId;
this.leaderId = null;
console.log(`[Node ${this.config.nodeId}] Starting election for term ${this.currentTerm}`);
// Request votes from peers
const req: RequestVoteRequest = {
term: this.currentTerm,
candidateId: this.config.nodeId,
lastLogIndex: this.log.length,
lastLogTerm: this.getLastLogTerm()
};
let votesReceived = 1; // Vote for self
const majority = Math.floor(this.config.peerIds.length / 2) + 1;
for (const peerId of this.config.peerIds) {
this.sendRequestVote(peerId, req).then(resp => {
if (resp.voteGranted) {
votesReceived++;
if (votesReceived >= majority && this.state === NodeState.CANDIDATE) {
this.becomeLeader();
}
} else if (resp.term > this.currentTerm) {
this.currentTerm = resp.term;
this.state = NodeState.FOLLOWER;
this.votedFor = null;
}
}).catch(() => {
// Peer unavailable, ignore
});
}
// Reset election timeout for next round
this.resetElectionTimeout();
}
/**
* Become leader after winning election
*/
private becomeLeader(): void {
this.state = NodeState.LEADER;
this.leaderId = this.config.nodeId;
console.log(`[Node ${this.config.nodeId}] Became leader for term ${this.currentTerm}`);
// Initialize leader state
for (const peerId of this.config.peerIds) {
this.nextIndex.set(peerId, this.log.length + 1);
this.matchIndex.set(peerId, 0);
}
// Start sending heartbeats
this.startHeartbeats();
}
/**
* Send heartbeats to all followers
*/
private startHeartbeats(): void {
if (this.heartbeatTimer) clearInterval(this.heartbeatTimer);
this.heartbeatTimer = setInterval(() => {
if (this.state === NodeState.LEADER) {
this.replicateLog();
}
}, this.config.heartbeatInterval);
}
/**
* Replicate log to followers (also sends heartbeats)
*/
private replicateLog(): void {
if (this.state !== NodeState.LEADER) return;
for (const followerId of this.config.peerIds) {
const nextIdx = this.nextIndex.get(followerId) || 1;
const prevLogIndex = nextIdx - 1;
const prevLogTerm = prevLogIndex > 0 ? this.log[prevLogIndex - 1].term : 0;
const entries = this.log.slice(nextIdx - 1);
const req: AppendEntriesRequest = {
term: this.currentTerm,
leaderId: this.config.nodeId,
prevLogIndex,
prevLogTerm,
entries,
leaderCommit: this.commitIndex
};
this.sendAppendEntries(followerId, req).then(resp => {
if (this.state !== NodeState.LEADER) return;
if (resp.term > this.currentTerm) {
this.currentTerm = resp.term;
this.state = NodeState.FOLLOWER;
this.votedFor = null;
if (this.heartbeatTimer) clearInterval(this.heartbeatTimer);
return;
}
if (resp.success) {
const lastIndex = prevLogIndex + entries.length;
this.matchIndex.set(followerId, lastIndex);
this.nextIndex.set(followerId, lastIndex + 1);
this.updateCommitIndex();
} else {
const currentNext = this.nextIndex.get(followerId) || 1;
this.nextIndex.set(followerId, Math.max(1, currentNext - 1));
}
}).catch(() => {
// Follower unavailable, will retry
});
}
}
/**
* Update commit index if majority has entry
*/
private updateCommitIndex(): void {
if (this.state !== NodeState.LEADER) return;
const N = this.log.length;
const majority = Math.floor(this.config.peerIds.length / 2) + 1;
for (let i = N; i > this.commitIndex; i--) {
if (this.log[i - 1].term !== this.currentTerm) continue;
let count = 1; // Leader has it
for (const matchIdx of this.matchIndex.values()) {
if (matchIdx >= i) count++;
}
if (count >= majority) {
this.commitIndex = i;
this.applyCommittedEntries();
break;
}
}
}
/**
* Apply committed entries to state machine
*/
private applyCommittedEntries(): void {
while (this.lastApplied < this.commitIndex) {
this.lastApplied++;
const entry = this.log[this.lastApplied - 1];
this.stateMachine.apply(entry);
}
}
/**
* Wait for an entry to be committed
*/
private async waitForCommit(index: number): Promise<void> {
return new Promise((resolve) => {
const check = () => {
if (this.commitIndex >= index) {
resolve();
} else {
setTimeout(check, 50);
}
};
check();
});
}
/**
* Reset election timeout with random value
*/
private resetElectionTimeout(): void {
if (this.electionTimer) clearTimeout(this.electionTimer);
const timeout = this.randomTimeout();
this.electionTimer = setTimeout(() => {
if (this.state !== NodeState.LEADER) {
this.startElection();
}
}, timeout);
}
private randomTimeout(): number {
const min = this.config.electionTimeoutMin;
const max = this.config.electionTimeoutMax;
return Math.floor(Math.random() * (max - min + 1)) + min;
}
private getLastLogTerm(): number {
if (this.log.length === 0) return 0;
return this.log[this.log.length - 1].term;
}
// ========== Network Layer (simplified) ==========
private async sendRequestVote(peerId: string, req: RequestVoteRequest): Promise<RequestVoteResponse> {
const url = `http://${peerId}/raft/request-vote`;
const response = await axios.post(url, req);
return response.data;
}
private async sendAppendEntries(peerId: string, req: AppendEntriesRequest): Promise<AppendEntriesResponse> {
const url = `http://${peerId}/raft/append-entries`;
const response = await axios.post(url, req);
return response.data;
}
// ========== Debug Methods ==========
getState() {
return {
nodeId: this.config.nodeId,
state: this.state,
term: this.currentTerm,
leaderId: this.leaderId,
logLength: this.log.length,
commitIndex: this.commitIndex,
stateMachine: this.stateMachine.getAll()
};
}
}
server.ts
import express from 'express';
import { RaftNode } from './raft-node';
import { Command } from './types';
export function createServer(node: RaftNode, port: number): express.Application {
const app = express();
app.use(express.json());
// Raft RPC endpoints
app.post('/raft/request-vote', (req, res) => {
const response = node.handleRequestVote(req.body);
res.json(response);
});
app.post('/raft/append-entries', (req, res) => {
const response = node.handleAppendEntries(req.body);
res.json(response);
});
// Client API endpoints
app.get('/kv/:key', (req, res) => {
const command: Command = { type: 'GET', key: req.params.key };
node.submitCommand(command)
.then(value => res.json({ key: req.params.key, value }))
.catch(err => res.status(500).json({ error: err.message }));
});
app.post('/kv', (req, res) => {
const command: Command = { type: 'SET', key: req.body.key, value: req.body.value };
node.submitCommand(command)
.then(result => res.json(result))
.catch(err => res.status(500).json({ error: err.message }));
});
app.delete('/kv/:key', (req, res) => {
const command: Command = { type: 'DELETE', key: req.params.key };
node.submitCommand(command)
.then(result => res.json(result))
.catch(err => res.status(500).json({ error: err.message }));
});
// Debug endpoint
app.get('/debug', (req, res) => {
res.json(node.getState());
});
return app;
}
index.ts
import { RaftNode } from './raft-node';
import { createServer } from './server';
const NODE_ID = process.env.NODE_ID || 'node1';
const PEER_IDS = process.env.PEER_IDS?.split(',') || [];
const PORT = parseInt(process.env.PORT || '3000');
const node = new RaftNode({
nodeId: NODE_ID,
peerIds: PEER_IDS,
electionTimeoutMin: 150,
electionTimeoutMax: 300,
heartbeatInterval: 50
});
node.start();
const app = createServer(node, PORT);
app.listen(PORT, () => {
console.log(`Node ${NODE_ID} listening on port ${PORT}`);
console.log(`Peers: ${PEER_IDS.join(', ')}`);
});
docker-compose.yml
version: '3.8'
services:
node1:
build: .
container_name: raft-node1
environment:
- NODE_ID=node1
- PEER_IDS=node2:3000,node3:3000
- PORT=3000
ports:
- "3001:3000"
node2:
build: .
container_name: raft-node2
environment:
- NODE_ID=node2
- PEER_IDS=node1:3000,node3:3000
- PORT=3000
ports:
- "3002:3000"
node3:
build: .
container_name: raft-node3
environment:
- NODE_ID=node3
- PEER_IDS=node1:3000,node2:3000
- PORT=3000
ports:
- "3003:3000"
Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]
Complete Python Implementation
Project Structure
python-raft/
├── requirements.txt
├── src/
│ ├── types.py # Shared types
│ ├── state_machine.py # KV store state machine
│ ├── raft_node.py # Complete Raft implementation
│ ├── server.py # Flask API server
│ └── __init__.py
├── app.py # Entry point
└── docker-compose.yml
requirements.txt
flask==3.0.0
requests==2.31.0
gunicorn==21.2.0
types.py
from enum import Enum
from dataclasses import dataclass
from typing import List, Optional, Union
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class LogEntry:
index: int
term: int
command: str
@dataclass
class RequestVoteRequest:
term: int
candidate_id: str
last_log_index: int
last_log_term: int
@dataclass
class RequestVoteResponse:
term: int
vote_granted: bool
@dataclass
class AppendEntriesRequest:
term: int
leader_id: str
prev_log_index: int
prev_log_term: int
entries: List[LogEntry]
leader_commit: int
@dataclass
class AppendEntriesResponse:
term: int
success: bool
@dataclass
class SetCommand:
type: str = 'SET'
key: str = ''
value: str = ''
@dataclass
class GetCommand:
type: str = 'GET'
key: str = ''
@dataclass
class DeleteCommand:
type: str = 'DELETE'
key: str = ''
Command = Union[SetCommand, GetCommand, DeleteCommand]
state_machine.py
from typing import Dict, Optional
import json
from .types import LogEntry
class KVStoreStateMachine:
"""Key-Value Store State Machine"""
def __init__(self):
self.data: Dict[str, str] = {}
def apply(self, entry: LogEntry) -> None:
"""Apply a committed log entry to the state machine"""
try:
command = json.loads(entry.command)
if command['type'] == 'SET':
self.data[command['key']] = command['value']
print(f"[State Machine] SET {command['key']} = {command['value']}")
elif command['type'] == 'DELETE':
if command['key'] in self.data:
del self.data[command['key']]
print(f"[State Machine] DELETE {command['key']}")
elif command['type'] == 'GET':
# Read-only, no state change
pass
except Exception as e:
print(f"[State Machine] Failed to apply entry: {e}")
def get(self, key: str) -> Optional[str]:
"""Get a value from the state machine"""
return self.data.get(key)
def get_all(self) -> Dict[str, str]:
"""Get all key-value pairs"""
return dict(self.data)
def clear(self) -> None:
"""Clear the state machine (for testing)"""
self.data.clear()
raft_node.py
import asyncio
import random
import json
from typing import Dict, List, Optional
from .types import (
NodeState, LogEntry, RequestVoteRequest, RequestVoteResponse,
AppendEntriesRequest, AppendEntriesResponse, Command
)
from .state_machine import KVStoreStateMachine
import requests
class ClusterConfig:
nodeId: str
peer_ids: List[str]
election_timeout_min: int
election_timeout_max: int
heartbeat_interval: int
def __init__(self, node_id: str, peer_ids: List[str],
election_timeout_min: int = 150,
election_timeout_max: int = 300,
heartbeat_interval: int = 50):
self.nodeId = node_id
self.peer_ids = peer_ids
self.election_timeout_min = election_timeout_min
self.election_timeout_max = election_timeout_max
self.heartbeat_interval = heartbeat_interval
class RaftNode:
def __init__(self, config: ClusterConfig):
self.config = config
self.state_machine = KVStoreStateMachine()
# Persistent state
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: List[LogEntry] = []
# Volatile state
self.commit_index = 0
self.last_applied = 0
self.state = NodeState.FOLLOWER
self.leader_id: Optional[str] = None
# Leader state
self.next_index: Dict[str, int] = {}
self.match_index: Dict[str, int] = {}
# Timers
self.election_task: Optional[asyncio.Task] = None
self.heartbeat_task: Optional[asyncio.Task] = None
# ========== Public API ==========
async def submit_command(self, command: Command) -> any:
"""Client: Submit a command to the cluster"""
# Redirect to leader if not leader
if self.state != NodeState.LEADER:
if self.leader_id:
raise Exception(f"Not a leader. Please redirect to {self.leader_id}")
raise Exception("No leader known. Please retry.")
# Handle GET commands (read-only)
if command.type == 'GET':
return self.state_machine.get(command.key)
# Append to local log
entry = LogEntry(
index=len(self.log) + 1,
term=self.current_term,
command=json.dumps(command.__dict__)
)
self.log.append(entry)
# Replicate to followers
await self.replicate_log()
# Wait for commit
await self._wait_for_commit(entry.index)
# Return result
if command.type == 'SET':
return {"key": command.key, "value": command.value}
elif command.type == 'DELETE':
return {"key": command.key, "deleted": True}
def start(self):
"""Start the node"""
asyncio.create_task(self._election_loop())
def stop(self):
"""Stop the node"""
if self.election_task:
self.election_task.cancel()
if self.heartbeat_task:
self.heartbeat_task.cancel()
# ========== RPC Handlers ==========
def handle_request_vote(self, req: RequestVoteRequest) -> RequestVoteResponse:
"""Handle RequestVote RPC"""
if req.term < self.current_term:
return RequestVoteResponse(term=self.current_term, vote_granted=False)
if req.term > self.current_term:
self.current_term = req.term
self.state = NodeState.FOLLOWER
self.voted_for = None
log_ok = (req.last_log_term > self._get_last_log_term() or
(req.last_log_term == self._get_last_log_term() and
req.last_log_index >= len(self.log)))
can_vote = self.voted_for is None or self.voted_for == req.candidate_id
if can_vote and log_ok:
self.voted_for = req.candidate_id
return RequestVoteResponse(term=self.current_term, vote_granted=True)
return RequestVoteResponse(term=self.current_term, vote_granted=False)
def handle_append_entries(self, req: AppendEntriesRequest) -> AppendEntriesResponse:
"""Handle AppendEntries RPC"""
if req.term < self.current_term:
return AppendEntriesResponse(term=self.current_term, success=False)
# Recognize leader
self.leader_id = req.leader_id
if req.term > self.current_term:
self.current_term = req.term
self.state = NodeState.FOLLOWER
self.voted_for = None
# Check log consistency
if req.prev_log_index > 0:
if len(self.log) < req.prev_log_index:
return AppendEntriesResponse(term=self.current_term, success=False)
prev_entry = self.log[req.prev_log_index - 1]
if prev_entry.term != req.prev_log_term:
return AppendEntriesResponse(term=self.current_term, success=False)
# Append new entries
if req.entries:
insert_index = req.prev_log_index
for entry in req.entries:
if insert_index < len(self.log):
existing = self.log[insert_index]
if existing.index == entry.index and existing.term == entry.term:
insert_index += 1
continue
self.log = self.log[:insert_index]
self.log.append(entry)
insert_index += 1
# Update commit index
if req.leader_commit > self.commit_index:
self.commit_index = min(req.leader_commit, len(self.log))
self._apply_committed_entries()
return AppendEntriesResponse(term=self.current_term, success=True)
# ========== Private Methods ==========
async def _election_loop(self):
"""Election timeout loop"""
while True:
timeout = self._random_timeout()
await asyncio.sleep(timeout / 1000)
if self.state != NodeState.LEADER:
await self._start_election()
async def _start_election(self):
"""Start election (convert to candidate)"""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.config.nodeId
self.leader_id = None
print(f"[Node {self.config.nodeId}] Starting election for term {self.current_term}")
req = RequestVoteRequest(
term=self.current_term,
candidate_id=self.config.nodeId,
last_log_index=len(self.log),
last_log_term=self._get_last_log_term()
)
votes_received = 1 # Vote for self
majority = len(self.config.peer_ids) // 2 + 1
tasks = []
for peer_id in self.config.peer_ids:
tasks.append(self._send_request_vote(peer_id, req))
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, RequestVoteResponse):
if result.vote_granted:
votes_received += 1
if votes_received >= majority and self.state == NodeState.CANDIDATE:
self._become_leader()
elif result.term > self.current_term:
self.current_term = result.term
self.state = NodeState.FOLLOWER
self.voted_for = None
def _become_leader(self):
"""Become leader after winning election"""
self.state = NodeState.LEADER
self.leader_id = self.config.nodeId
print(f"[Node {self.config.nodeId}] Became leader for term {self.current_term}")
# Initialize leader state
for peer_id in self.config.peer_ids:
self.next_index[peer_id] = len(self.log) + 1
self.match_index[peer_id] = 0
# Start heartbeats
asyncio.create_task(self._heartbeat_loop())
async def _heartbeat_loop(self):
"""Send heartbeats to followers"""
while self.state == NodeState.LEADER:
await self.replicate_log()
await asyncio.sleep(self.config.heartbeat_interval / 1000)
async def replicate_log(self):
"""Replicate log to followers"""
if self.state != NodeState.LEADER:
return
tasks = []
for follower_id in self.config.peer_ids:
next_idx = self.next_index.get(follower_id, 1)
prev_log_index = next_idx - 1
prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
entries = self.log[next_idx - 1:]
req = AppendEntriesRequest(
term=self.current_term,
leader_id=self.config.nodeId,
prev_log_index=prev_log_index,
prev_log_term=prev_log_term,
entries=entries,
leader_commit=self.commit_index
)
tasks.append(self._send_append_entries(follower_id, req))
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, result in enumerate(results):
follower_id = self.config.peer_ids[i]
if isinstance(result, AppendEntriesResponse):
if result.term > self.current_term:
self.current_term = result.term
self.state = NodeState.FOLLOWER
self.voted_for = None
return
if result.success:
last_index = self.log[len(self.log) - 1].index if self.log else 0
self.match_index[follower_id] = last_index
self.next_index[follower_id] = last_index + 1
await self._update_commit_index()
else:
current_next = self.next_index.get(follower_id, 1)
self.next_index[follower_id] = max(1, current_next - 1)
async def _update_commit_index(self):
"""Update commit index if majority has entry"""
if self.state != NodeState.LEADER:
return
N = len(self.log)
majority = len(self.config.peer_ids) // 2 + 1
for i in range(N, self.commit_index, -1):
if self.log[i - 1].term != self.current_term:
continue
count = 1 # Leader has it
for match_idx in self.match_index.values():
if match_idx >= i:
count += 1
if count >= majority:
self.commit_index = i
self._apply_committed_entries()
break
def _apply_committed_entries(self):
"""Apply committed entries to state machine"""
while self.last_applied < self.commit_index:
self.last_applied += 1
entry = self.log[self.last_applied - 1]
self.state_machine.apply(entry)
async def _wait_for_commit(self, index: int):
"""Wait for an entry to be committed"""
while self.commit_index < index:
await asyncio.sleep(0.05)
def _random_timeout(self) -> int:
"""Generate random election timeout"""
return random.randint(
self.config.election_timeout_min,
self.config.election_timeout_max
)
def _get_last_log_term(self) -> int:
"""Get the term of the last log entry"""
if not self.log:
return 0
return self.log[-1].term
# ========== Network Layer ==========
async def _send_request_vote(self, peer_id: str, req: RequestVoteRequest) -> RequestVoteResponse:
"""Send RequestVote RPC to peer"""
url = f"http://{peer_id}/raft/request-vote"
try:
response = requests.post(url, json=req.__dict__, timeout=1)
return RequestVoteResponse(**response.json())
except:
return RequestVoteResponse(term=self.current_term, vote_granted=False)
async def _send_append_entries(self, peer_id: str, req: AppendEntriesRequest) -> AppendEntriesResponse:
"""Send AppendEntries RPC to peer"""
url = f"http://{peer_id}/raft/append-entries"
try:
data = {
'term': req.term,
'leaderId': req.leader_id,
'prevLogIndex': req.prev_log_index,
'prevLogTerm': req.prev_log_term,
'entries': [e.__dict__ for e in req.entries],
'leaderCommit': req.leader_commit
}
response = requests.post(url, json=data, timeout=1)
return AppendEntriesResponse(**response.json())
except:
return AppendEntriesResponse(term=self.current_term, success=False)
# ========== Debug Methods ==========
def get_state(self) -> dict:
"""Get node state for debugging"""
return {
'nodeId': self.config.nodeId,
'state': self.state.value,
'term': self.current_term,
'leaderId': self.leader_id,
'logLength': len(self.log),
'commitIndex': self.commit_index,
'stateMachine': self.state_machine.get_all()
}
server.py
from flask import Flask, request, jsonify
from .raft_node import RaftNode, ClusterConfig
def create_server(node: RaftNode):
app = Flask(__name__)
# Raft RPC endpoints
@app.route('/raft/request-vote', methods=['POST'])
def request_vote():
response = node.handle_request_vote(
RequestVoteResponse(**request.json)
)
return jsonify(response.__dict__)
@app.route('/raft/append-entries', methods=['POST'])
def append_entries():
# Convert request to proper format
data = request.json
entries = [LogEntry(**e) for e in data.get('entries', [])]
req = AppendEntriesRequest(
term=data['term'],
leader_id=data['leaderId'],
prev_log_index=data['prevLogIndex'],
prev_log_term=data['prevLogTerm'],
entries=entries,
leader_commit=data['leaderCommit']
)
response = node.handle_append_entries(req)
return jsonify(response.__dict__)
# Client API endpoints
@app.route('/kv/<key>', methods=['GET'])
def get_key(key):
command = GetCommand(key=key)
try:
value = asyncio.run(node.submit_command(command))
return jsonify({'key': key, 'value': value})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/kv', methods=['POST'])
def set_key():
command = SetCommand(key=request.json['key'], value=request.json['value'])
try:
result = asyncio.run(node.submit_command(command))
return jsonify(result)
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/kv/<key>', methods=['DELETE'])
def delete_key(key):
command = DeleteCommand(key=key)
try:
result = asyncio.run(node.submit_command(command))
return jsonify(result)
except Exception as e:
return jsonify({'error': str(e)}), 500
# Debug endpoint
@app.route('/debug', methods=['GET'])
def debug():
return jsonify(node.get_state())
return app
app.py
import os
from src.types import ClusterConfig
from src.raft_node import RaftNode
from src.server import create_server
NODE_ID = os.getenv('NODE_ID', 'node1')
PEER_IDS = os.getenv('PEER_IDS', '').split(',') if os.getenv('PEER_IDS') else []
PORT = int(os.getenv('PORT', '5000'))
config = ClusterConfig(
node_id=NODE_ID,
peer_ids=PEER_IDS
)
node = RaftNode(config)
node.start()
app = create_server(node)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=PORT)
docker-compose.yml (Python)
version: '3.8'
services:
node1:
build: .
container_name: python-raft-node1
environment:
- NODE_ID=node1
- PEER_IDS=node2:5000,node3:5000
- PORT=5000
ports:
- "5001:5000"
node2:
build: .
container_name: python-raft-node2
environment:
- NODE_ID=node2
- PEER_IDS=node1:5000,node3:5000
- PORT=5000
ports:
- "5002:5000"
node3:
build: .
container_name: python-raft-node3
environment:
- NODE_ID=node3
- PEER_IDS=node1:5000,node2:5000
- PORT=5000
ports:
- "5003:5000"
Dockerfile (Python)
FROM python:3.11-alpine
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["gunicorn", "-b", "0.0.0.0:5000", "app:app"]
Running the System
TypeScript
# Build
npm run build
# Run with Docker Compose
docker-compose up
# Test the cluster
curl -X POST http://localhost:3001/kv -H "Content-Type: application/json" -d '{"key":"foo","value":"bar"}'
curl http://localhost:3001/kv/foo
curl http://localhost:3002/debug # Check node state
Python
# Run with Docker Compose
docker-compose up
# Test the cluster
curl -X POST http://localhost:5001/kv -H "Content-Type: application/json" -d '{"key":"foo","value":"bar"}'
curl http://localhost:5001/kv/foo
curl http://localhost:5002/debug # Check node state
Exercises
Exercise 1: Basic Operations
- Start the 3-node cluster
- Wait for leader election
- SET key=value on the leader
- GET the key from all nodes
- Verify all nodes return the same value
Expected Result: All nodes return the committed value.
Exercise 2: Leader Failover
- Start the cluster and write some data
- Kill the leader container
- Observe a new leader being elected
- Continue writing data
- Restart the old leader
- Verify it catches up
Expected Result: System continues operating with new leader, old leader rejoins as follower.
Exercise 3: Network Partition
- Start a 5-node cluster
- Isolate 2 nodes (simulate partition)
- Verify majority (3 nodes) can still commit
- Heal the partition
- Verify isolated nodes catch up
Expected Result: Majority side continues, minority cannot commit, rejoin works.
Exercise 4: Persistence Test
- Write data to the cluster
- Stop all nodes
- Restart all nodes
- Verify data is recovered
Expected Result: All data survives restart.
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Reading from followers | Stale reads | Always read from leader or implement lease reads |
| No heartbeats | Unnecessary elections | Ensure heartbeat timer runs continuously |
| Client timeout | Failed writes | Wait for commit, don't return immediately |
| Split brain | Multiple leaders | Randomized timeouts + voting rules prevent this |
Key Takeaways
- Complete Raft combines leader election + log replication for consensus
- State machine applies committed commands deterministically
- Client API provides transparent access to the distributed system
- Failover is automatic - new leader elected when old one fails
- Safety guarantees ensure no conflicting commits
Congratulations! You've completed the Consensus System. You now understand one of the hardest concepts in distributed systems!
Next: Reference Materials →
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
Docker Setup
This guide covers installing Docker and Docker Compose for running the course examples.
Installing Docker
Linux
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
macOS
Download Docker Desktop from docker.com
Windows
Download Docker Desktop from docker.com
Verify Installation
docker --version
docker-compose --version
Running Course Examples
Each chapter includes a Docker Compose file:
cd examples/01-queue
docker-compose up
Common Commands
# Start services
docker-compose up
# Start in background
docker-compose up -d
# View logs
docker-compose logs
# Stop services
docker-compose down
# Rebuild after code changes
docker-compose up --build
Troubleshooting
See Troubleshooting for common issues.
Troubleshooting
Common issues and solutions when working with the course examples.
Docker Issues
Port Already in Use
Error: bind: address already in use
Solution: Change the port in docker-compose.yml or stop the conflicting service.
Permission Denied
Error: permission denied while trying to connect to the Docker daemon
Solution: Add your user to the docker group:
sudo usermod -aG docker $USER
newgrp docker
Build Issues
TypeScript: Module Not Found
Solution: Install dependencies:
npm install
Python: Module Not Found
Solution: Install dependencies:
pip install -r requirements.txt
Runtime Issues
Connection Refused
Solution: Check that all services are running:
docker-compose ps
Node Can't Connect to Peers
Solution: Verify network configuration in docker-compose.yml. Ensure all nodes are on the same network.
Getting Help
If you encounter issues not covered here:
- Check the Docker logs:
docker-compose logs - Verify your Docker installation:
docker --version - See Further Reading for additional resources
Further Reading
Resources for deepening your understanding of distributed systems.
Books
| Title | Author | Focus |
|---|---|---|
| Designing Data-Intensive Applications | Martin Kleppmann | Modern database and distributed system design |
| Distributed Systems: Principles and Paradigms | Tanenbaum & van Steen | Academic foundations |
| Introduction to Reliable Distributed Programming | Cachin, Guerraoui, Rodrigues | Formal foundations |
Papers
Foundational
- Brewer, E. A. (2000). "Towards robust distributed systems"
- Gilbert, S. & Lynch, N. (2002). "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services"
- Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of distributed consensus with one faulty process"
Consensus
- Ongaro, D. & Ousterhout, J. (2014). "In Search of an Understandable Consensus Algorithm (Raft)"
- Lamport, L. (2001). "Paxos Made Simple"
Online Resources
- The Raft Consensus Algorithm
- Jepsen: Distributed Systems Safety Analysis
- Distributed Systems Reading List
Video Lectures
- MIT 6.824: Distributed Systems
- Stanford CS247: Advanced Distributed Systems
Practice
- Build your own distributed system from scratch
- Contribute to open-source distributed databases
- Participate in distributed systems hackathons