Data Partitioning
Session 3, Part 1 - 25 minutes
Learning Objectives
- Understand what data partitioning (sharding) is
- Compare hash-based vs range-based partitioning
- Learn how partitioning affects query performance
- Recognize the trade-offs of different partitioning strategies
What is Partitioning?
Data partitioning (also called sharding) is the process of splitting your data across multiple nodes based on a partitioning key. Each node holds a subset of the total data.
graph TB
subgraph "Application View"
App["Your Application"]
Data[("All Data")]
App --> Data
end
subgraph "Reality: Partitioned Storage"
Node1["Node 1<br/>Keys: user_1<br/>user_4<br/>user_7"]
Node2["Node 2<br/>Keys: user_2<br/>user_5<br/>user_8"]
Node3["Node 3<br/>Keys: user_3<br/>user_6<br/>user_9"]
end
App -->|"read/write"| Node1
App -->|"read/write"| Node2
App -->|"read/write"| Node3
style Node1 fill:#e1f5fe
style Node2 fill:#e1f5fe
style Node3 fill:#e1f5fe
Why Partition Data?
| Benefit | Description |
|---|---|
| Scalability | Store more data than fits on one machine |
| Performance | Distribute load across multiple nodes |
| Availability | One partition failure doesn't affect others |
The Partitioning Challenge
The key question is: How do we decide which data goes on which node?
graph LR
Key["user:12345"] --> Router{Partitioning<br/>Function}
Router -->|"hash(key) % N"| N1[Node 1]
Router --> N2[Node 2]
Router --> N3[Node 3]
style Router fill:#ff9,stroke:#333,stroke-width:3px
Partitioning Strategies
1. Hash-Based Partitioning
Apply a hash function to the key, then modulo the number of nodes:
node = hash(key) % number_of_nodes
graph TB
subgraph "Hash-Based Partitioning (3 nodes)"
Key1["user:alice"] --> H1["hash() % 3"]
Key2["user:bob"] --> H2["hash() % 3"]
Key3["user:carol"] --> H3["hash() % 3"]
H1 -->|"= 1"| N1[Node 1]
H2 -->|"= 2"| N2[Node 2]
H3 -->|"= 0"| N0[Node 0]
style N1 fill:#c8e6c9
style N2 fill:#c8e6c9
style N0 fill:#c8e6c9
end
TypeScript Example:
function getNode(key: string, totalNodes: number): number {
// Simple hash function
let hash = 0;
for (let i = 0; i < key.length; i++) {
hash = ((hash << 5) - hash) + key.charCodeAt(i);
hash = hash & hash; // Convert to 32bit integer
}
return Math.abs(hash) % totalNodes;
}
// Examples
console.log(getNode('user:alice', 3)); // => 1
console.log(getNode('user:bob', 3)); // => 2
console.log(getNode('user:carol', 3)); // => 0
Python Example:
def get_node(key: str, total_nodes: int) -> int:
"""Determine which node should store this key."""
hash_value = hash(key) # Built-in hash function
return abs(hash_value) % total_nodes
# Examples
print(get_node('user:alice', 3)) # => 1
print(get_node('user:bob', 3)) # => 2
print(get_node('user:carol', 3)) # => 0
Advantages:
- ✅ Even data distribution
- ✅ Simple to implement
- ✅ No hotspots (assuming good hash function)
Disadvantages:
- ❌ Cannot do efficient range queries
- ❌ Rebalancing is expensive when adding/removing nodes
2. Range-Based Partitioning
Assign key ranges to each node:
graph TB
subgraph "Range-Based Partitioning (3 nodes)"
R1["Node 1<br/>a-m"]
R2["Node 2<br/>n-s"]
R3["Node 3<br/>t-z"]
Key1["alice"] --> R1
Key2["bob"] --> R1
Key3["nancy"] --> R2
Key4["steve"] --> R2
Key5["tom"] --> R3
Key6["zoe"] --> R3
style R1 fill:#c8e6c9
style R2 fill:#c8e6c9
style R3 fill:#c8e6c9
end
TypeScript Example:
interface Range {
start: string;
end: string;
node: number;
}
const ranges: Range[] = [
{ start: 'a', end: 'm', node: 1 },
{ start: 'n', end: 's', node: 2 },
{ start: 't', end: 'z', node: 3 }
];
function getNodeByRange(key: string): number {
for (const range of ranges) {
if (key >= range.start && key <= range.end) {
return range.node;
}
}
throw new Error(`No range found for key: ${key}`);
}
// Examples
console.log(getNodeByRange('alice')); // => 1
console.log(getNodeByRange('nancy')); // => 2
console.log(getNodeByRange('tom')); // => 3
Python Example:
from typing import List, Tuple
ranges: List[Tuple[str, str, int]] = [
('a', 'm', 1),
('n', 's', 2),
('t', 'z', 3)
]
def get_node_by_range(key: str) -> int:
"""Determine which node based on key range."""
for start, end, node in ranges:
if start <= key <= end:
return node
raise ValueError(f"No range found for key: {key}")
# Examples
print(get_node_by_range('alice')) # => 1
print(get_node_by_range('nancy')) # => 2
print(get_node_by_range('tom')) # => 3
Advantages:
- ✅ Efficient range queries
- ✅ Can optimize for data access patterns
Disadvantages:
- ❌ Uneven distribution (hotspots)
- ❌ Complex to load balance
The Rebalancing Problem
What happens when you add or remove nodes?
stateDiagram-v2
[*] --> Stable: 3 Nodes
Stable --> Rebalancing: Add Node 4
Rebalancing --> Stable: Move 25% of data
Stable --> Rebalancing: Remove Node 2
Rebalancing --> Stable: Redistribute data
Simple Modulo Hashing Problem
With hash(key) % N, changing N from 3 to 4 means most keys move to different nodes:
| Key | hash % 3 | hash % 4 | Moved? |
|---|---|---|---|
| user:1 | 1 | 1 | ❌ |
| user:2 | 2 | 2 | ❌ |
| user:3 | 0 | 3 | ✅ |
| user:4 | 1 | 0 | ✅ |
| user:5 | 2 | 1 | ✅ |
| user:6 | 0 | 2 | ✅ |
75% of keys moved!
Consistent Hashing (Advanced)
A technique to minimize data movement when nodes change:
graph TB
subgraph "Hash Ring"
Ring["Virtual Ring (0 - 2^32)"]
N1["Node 1<br/>position: 100"]
N2["Node 2<br/>position: 500"]
N3["Node 3<br/>position: 900"]
K1["Key A<br/>hash: 150"]
K2["Key B<br/>hash: 600"]
K3["Key C<br/>hash: 950"]
end
Ring --> N1
Ring --> N2
Ring --> N3
K1 -->|"clockwise"| N2
K2 -->|"clockwise"| N3
K3 -->|"clockwise"| N1
style Ring fill:#f9f,stroke:#333,stroke-width:2px
Key Idea: Each key is assigned to the first node clockwise from its hash position.
When adding/removing a node, only keys in that node's range move.
Query Patterns and Partitioning
Your query patterns should influence your partitioning strategy:
Common Query Patterns
| Query Type | Best Partitioning | Example |
|---|---|---|
| Key-value lookups | Hash-based | Get user by ID |
| Range scans | Range-based | Users registered last week |
| Multi-key access | Composite hash | Orders by customer |
| Geographic queries | Location-based | Nearby restaurants |
Example: User Data Partitioning
graph TB
subgraph "Application: Social Network"
Query1["Get User Profile<br/>SELECT * FROM users WHERE id = ?"]
Query2["List Friends<br/>SELECT * FROM friends WHERE user_id = ?"]
Query3["Timeline Posts<br/>SELECT * FROM posts WHERE created_at > ?"]
end
subgraph "Partitioning Decision"
Query1 -->|"hash(user_id)"| Hash[Hash-Based]
Query2 -->|"hash(user_id)"| Hash
Query3 -->|"range(created_at)"| Range[Range-Based]
end
subgraph "Result"
Hash --> H["User data & friends<br/>partitioned by user_id"]
Range --> R["Posts partitioned<br/>by date range"]
end
Trade-offs Summary
| Strategy | Distribution | Range Queries | Rebalancing | Complexity |
|---|---|---|---|---|
| Hash-based | Even | Poor | Expensive | Low |
| Range-based | Potentially uneven | Excellent | Moderate | Medium |
| Consistent hashing | Even | Poor | Minimal | High |
Real-World Examples
| System | Partitioning Strategy | Notes |
|---|---|---|
| Redis Cluster | Hash slots (16384 slots) | Consistent hashing |
| Cassandra | Token-aware (hash ring) | Configurable partitioner |
| MongoDB | Shard key ranges | Range-based on shard key |
| DynamoDB | Hash + range (composite) | Supports composite keys |
| PostgreSQL | Not native | Use extensions like Citus |
Summary
Key Takeaways
- Partitioning splits data across multiple nodes for scalability
- Hash-based gives even distribution but poor range queries
- Range-based enables range scans but can create hotspots
- Rebalancing is a key challenge when nodes change
- Query patterns should drive your partitioning strategy
Check Your Understanding
- Why is hash-based partitioning better for even distribution?
- When would you choose range-based over hash-based?
- What happens to data placement when you add a new node with simple modulo hashing?
- How does consistent hashing minimize data movement?
🧠 Chapter Quiz
Test your mastery of these concepts! These questions will challenge your understanding and reveal any gaps in your knowledge.
What's Next
Now that we understand how to partition data, let's explore the fundamental trade-offs in distributed data systems: CAP Theorem