Danube Load Manager & Rebalancing Architecture
Danube's Load Manager is a distributed system that ensures optimal topic placement across broker clusters. It combines intelligent topic assignment for new topics with automated rebalancing to maintain cluster health as workloads evolve. This creates a self-optimizing messaging infrastructure that adapts to changing conditions without manual intervention.
This page explains what the Load Manager is, why it's essential for production Danube clusters, how it assigns topics to brokers, and how automated rebalancing keeps your cluster balanced over time.
What is the Load Manager?
The Load Manager is a critical subsystem in Danube that handles the distribution of topics across brokers in a cluster. Think of it as the "traffic controller" that decides which broker should handle which topic, ensuring that no single broker becomes overwhelmed while others sit idle.
Core Responsibilities
- Topic Assignment - Places new topics on the least loaded broker using intelligent ranking algorithms
- Load Monitoring - Continuously tracks broker resource utilization (CPU, memory, throughput, topic count)
- Cluster Rebalancing - Automatically moves topics between brokers to maintain balanced load distribution
- Failover Management - Reassigns topics from failed brokers to healthy ones
- Resource Optimization - Prevents hotspots and ensures efficient cluster utilization
Why It Matters
Without intelligent load management, clusters suffer from:
- Hotspots - One broker handles most traffic while others are idle
- Resource waste - Underutilized brokers mean wasted infrastructure costs
- Performance degradation - Overloaded brokers cause latency spikes and message backlogs
- Manual toil - Operators spend time manually moving topics to fix imbalances
- Scaling challenges - Hard to add/remove brokers without disrupting service
The Load Manager solves these problems automatically, keeping your cluster healthy 24/7.
High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ETCD Metadata Store β
β β’ Broker registrations β’ Topic assignments β
β β’ Load reports β’ Rebalancing history β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β
β Watch events
β
ββββββββββββββββββΌββββββββββββββββ ββββββββββββββββββββββ
β Load Manager (Leader) β β Other Brokers β
β β β (Followers) β
β 1. Monitor Load Reports βββββββββββ€ β’ Publish loads β
β 2. Calculate Rankings β β β’ Watch events β
β 3. Assign New Topics β β β’ Execute unloads β
β 4. Detect Imbalance (CV) β ββββββββββββββββββββββ
β 5. Select Topics to Move β
β 6. Execute Rebalancing β
β β
β Components: β
β ββ Rankings Calculator β
β ββ Imbalance Detector β
β ββ Candidate Selector β
β ββ Rebalancing Executor β
βββββββββββββββββββββββββββββββββββ
Flow for New Topic Assignment:
1. Topic created β Added to /unassigned/ path
2. Load Manager (leader) watches unassigned topics
3. Calculates broker rankings (least loaded first)
4. Assigns topic to top-ranked broker
5. Broker receives assignment and loads topic
Flow for Automated Rebalancing:
1. Load Manager checks cluster balance every N minutes
2. Calculates imbalance (coefficient of variation)
3. If CV > threshold β Select topics to move
4. Creates unassigned marker with target broker hint
5. Deletes assignment from overloaded broker
6. Topic moves to underloaded broker
7. Records move in history for rate limiting
Component Roles
Load Manager (Leader-Only)
Centralized coordinator that assigns topics and executes rebalancing. Only the elected cluster leader runs these operations to ensure consistency.
Broker Load Reports
Every broker publishes resource utilization metrics (CPU, memory, disk I/O, network I/O) and per-topic statistics (throughput, connections, backlog) to ETCD every 30 seconds.
Rankings Calculator
Scores all brokers based on their current load using configurable strategies (Fair, Balanced, WeightedLoad). Lower scores mean less loaded.
Imbalance Detector
Calculates statistical measures (coefficient of variation, standard deviation) to determine if cluster needs rebalancing.
Rebalancing Executor
Orchestrates topic moves using graceful unload workflow. Enforces rate limits and cooldowns to prevent disruption.
ETCD Metadata Store
Single source of truth for cluster state. Stores broker registrations, topic assignments, load reports, and rebalancing history.
Core Concepts
Brokers and Broker States
A broker is a server instance running the Danube broker process. Each broker has a unique ID and maintains:
- Topic assignments - Which topics it currently serves
- Resource metrics - CPU, memory, disk I/O, network I/O usage
- Topic metrics - Per-topic throughput, connections, backlog
Broker states:
- Active - Normal operation, can receive new topics
- Draining - Preparing to shut down, topics being unloaded
- Drained - All topics moved, ready for removal
Only active brokers are considered for topic assignment and rebalancing calculations.
Load Reports
Every broker generates a LoadReport every 30 seconds containing:
System-Level Metrics:
- CPU usage percentage (0-100%)
- Memory usage percentage (0-100%)
- Disk I/O throughput (bytes/second)
- Network I/O throughput (bytes/second)
Topic-Level Metrics:
- Message rate (messages/second)
- Byte rate (bytes/second, converted to Mbps)
- Producer count
- Consumer count
- Subscription count
- Backlog messages
Aggregate Metrics:
- Total throughput (Mbps across all topics)
- Total message rate
- Total lag/backlog
These reports are published to ETCD where the Load Manager consumes them for ranking and rebalancing decisions.
Broker Rankings
Rankings are a sorted list of (broker_id, load_score) pairs ordered by load (ascending = less loaded first). The Load Manager maintains these rankings in-memory and updates them whenever new load reports arrive.
Rankings determine:
- Which broker receives the next topic assignment
- Which brokers are overloaded (candidates for offloading)
- Which brokers are underloaded (candidates for receiving topics)
The ranking algorithm is configurable (see Assignment Strategies).
Coefficient of Variation (CV)
The coefficient of variation measures cluster balance:
Interpretation:
- CV < 20% - Excellent balance, loads are very similar
- CV 20-30% - Good balance, acceptable variance
- CV 30-40% - Moderate imbalance, consider rebalancing
- CV > 40% - Significant imbalance, rebalancing recommended
The Load Manager uses CV thresholds to trigger automated rebalancing based on the configured aggressiveness level.
Assignment Strategies
Three strategies control how new topics are assigned:
| Strategy | Algorithm | Best For | Overhead |
|---|---|---|---|
| Fair | Topic count only | Development, testing, predictable placement | Lowest |
| Balanced | Weighted: topic_load (30%) + CPU (35%) + Memory (35%) | General production, mixed workloads | Medium |
| WeightedLoad | Adaptive bottleneck detection | Variable workloads, auto-optimization | Highest |
Fair Strategy:
Simple topic counting. Broker with fewest topics gets the next assignment. Ignores actual resource usage and topic throughput.
Balanced Strategy (RECOMMENDED):
Multi-factor scoring combining weighted topic load (message rate, connections, backlog) with system resources. Most reliable for production clusters.
WeightedLoad Strategy:
Smart algorithm that detects which resource is under most pressure (CPU, memory, throughput, etc.) and prioritizes it in scoring. Adapts automatically to workload patterns.
Rebalancing Aggressiveness
Three aggressiveness levels control automated rebalancing behavior:
| Level | CV Threshold | Check Interval | Max Moves/Hour | Best For |
|---|---|---|---|---|
| Conservative | 40% | 10 minutes | 5 | Stable clusters, minimal disruption |
| Balanced | 30% | 5 minutes | 10 | General production use |
| Aggressive | 20% | 2 minutes | 20 | Fast-changing workloads, test clusters |
Higher aggressiveness means:
- More frequent balance checks
- Lower tolerance for imbalance
- More topic moves per hour
- Faster response to load changes
- Higher metadata store overhead
Topic Blacklists
Administrators can configure topic patterns that should never be rebalanced. This is useful for:
- Critical topics that must stay on specific brokers
- Low-latency topics where moves would cause disruption
- Topics with large backlogs (expensive to move)
- System topics used for coordination
Blacklist patterns support wildcards:
blacklist_topics:
- "/system/*" # All system topics
- "/default/critical-*" # Critical workload topics
- "/analytics/logs" # Specific high-volume topic
How It Works: Topic Assignment
When a new topic is created, it must be assigned to a broker. This process ensures the topic lands on the broker best suited to handle it.
Step-by-Step Flow
1. Topic Creation
A producer creates a topic by sending a message to a topic that doesn't exist yet. The broker creates the topic metadata in ETCD under /topics/{namespace}/{topic}.
2. Unassigned Entry Creation
The broker creates an entry in ETCD at:
This signals to the Load Manager that a topic needs assignment. The value is empty (no marker data).
3. Load Manager Detection
The Load Manager (running on the leader broker) watches the /cluster/unassigned/ path. When a new entry appears, it triggers the assignment workflow.
4. Rankings Calculation
The Load Manager retrieves the latest load reports for all active brokers and calculates rankings using the configured strategy:
Fair Strategy:
Balanced Strategy:
weighted_topic_load = (count Γ 0.2 + throughput Γ 0.3 +
connections Γ 0.3 + backlog Γ 0.2)
score = (weighted_topic_load Γ 0.3) + (CPU Γ 0.35) + (Memory Γ 0.35)
WeightedLoad Strategy:
1. Normalize all metrics (0.0 to 1.0)
2. Find bottleneck (max utilization metric)
3. If bottleneck > 70% β Weight it heavily (50%)
4. Otherwise β Balance evenly (25% each metric)
Brokers are sorted by score (ascending), so the least loaded broker appears first.
5. Broker Selection
The Load Manager selects brokers from the top of the rankings (lowest scores). To prevent all assignments going to one broker when multiple brokers have similar loads, it uses round-robin within threshold:
1. Get minimum score from rankings
2. Find all brokers within 10% of minimum score
3. Round-robin through these candidates
4. Track last selected broker to ensure rotation
This ensures even distribution when brokers have comparable loads.
6. Assignment Creation
The Load Manager writes the assignment to ETCD:
The target broker watches this path and receives the assignment event.
7. Topic Loading
The target broker:
- Creates the topic instance locally
- Initializes subscriptions and storage
- Begins accepting producer connections
- Starts reporting topic metrics in load reports
8. Cleanup
The Load Manager deletes the unassigned entry:
This prevents duplicate assignments.
9. Internal State Update
The Load Manager updates its in-memory broker usage map:
This ensures the next assignment sees the updated load, even before the broker's next load report arrives.
Assignment Visualization
New Topic Created
β
βΌ
βββββββββββββββββββ
β /topics/ns/top β Topic metadata created
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β /unassigned/... β Unassigned marker created
βββββββββββββββββββ
β
β Watch event
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Load Manager (Leader) β
β β
β 1. Read load reports from ETCD β
β 2. Calculate broker rankings β
β Broker 1: score=20 β least β
β Broker 2: score=25 β
β Broker 3: score=45 β most β
β 3. Select Broker 1 (round-robin) β
ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β /brokers/1/ns/top β Assignment created
βββββββββββββββββββββββ
β
β Watch event
βΌ
βββββββββββββββββββ
β Broker 1 β Loads topic, starts serving
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β DELETE /unassigned/... β Cleanup
βββββββββββββββββββ
Special Cases
Broker Unload (Manual):
When an administrator unloads a broker, topics create unassigned markers with:
The Load Manager excludes the source broker when selecting the target.
Broker Failure:
When a broker crashes, the leader detects the registration deletion in ETCD and:
- Deletes all assignments from the failed broker
- Creates unassigned markers for all topics (no hint)
- Normal assignment workflow reassigns topics to healthy brokers
All Brokers Equally Loaded:
When all brokers have identical scores, the Load Manager uses stable round-robin rotation to distribute topics evenly.
How It Works: Automated Rebalancing
As workloads change over time, clusters can become imbalanced. Some brokers handle more traffic than others, leading to hotspots and inefficiency. Automated rebalancing solves this by continuously monitoring cluster health and moving topics to restore balance.
Rebalancing Lifecycle
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Rebalancing Loop (Leader) β
β β
β Every N minutes (configured by check_interval): β
β β
β 1. Check Leadership β
β ββ If follower β Skip cycle β
β ββ If leader β Continue β
β β
β 2. Check Configuration β
β ββ If rebalancing disabled β Skip cycle β
β ββ If enabled β Continue β
β β
β 3. Check Cluster Health β
β ββ If < 2 brokers β Skip cycle β
β ββ If >= 2 brokers β Continue β
β β
β 4. Calculate Imbalance Metrics β
β ββ Get broker rankings β
β ββ Calculate mean, std_deviation, CV β
β ββ Identify overloaded/underloaded brokers β
β β
β 5. Decide If Rebalancing Needed β
β ββ If CV < threshold β Cluster balanced, skip β
β ββ If CV > threshold β Continue β
β β
β 6. Select Topics to Move β
β ββ For each overloaded broker β
β ββ Filter blacklisted topics β
β ββ Filter recently moved topics (cooldown) β
β ββ Filter topics too young (min_topic_age) β
β ββ Sort by load (lightest first) β
β ββ Select lightest topic β
β β
β 7. Select Target Brokers β
β ββ Find underloaded brokers from rankings β
β ββ Exclude source broker β
β ββ Select least loaded active broker β
β β
β 8. Execute Rebalancing Moves β
β ββ Check rate limit (max_moves_per_hour) β
β ββ Create unassigned marker with target hint β
β ββ Delete assignment from source broker β
β ββ Wait for topic reassignment β
β ββ Record move in history β
β ββ Apply cooldown delay β
β ββ Repeat for next move β
β β
β 9. Update Metrics & Logs β
β ββ Log cycle completion and move count β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Imbalance Detection
The Load Manager calculates statistical measures to determine if rebalancing is needed:
Metrics Calculated:
- Mean load - Average load across all active brokers
- Standard deviation - Spread of loads around the mean
- Coefficient of variation (CV) -
(std_dev / mean) Γ 100% - Max load - Highest loaded broker
- Min load - Lowest loaded broker
Broker Classification:
- Overloaded - Load > (mean + 1 std_dev)
- Underloaded - Load < (mean - 1 std_dev) AND load < (mean Γ 0.5)
- Normal - Between overloaded and underloaded
Example:
Broker 1: load = 45
Broker 2: load = 30
Broker 3: load = 25
mean = 33.33
std_dev = 8.50
CV = (8.50 / 33.33) = 0.255 = 25.5%
Overloaded: load > 41.83 β Broker 1
Underloaded: load < 25.00 AND load < 16.67 β None
Normal: Broker 2, Broker 3
If CV > threshold (based on aggressiveness level), rebalancing is triggered.
Candidate Selection
Once imbalance is detected, the Load Manager selects topics to move:
Selection Criteria:
- Source brokers - Only overloaded brokers (load > mean + std_dev)
- Topic age - Skip topics younger than
min_topic_age_seconds(default 5 minutes) - Blacklist - Skip topics matching blacklist patterns
- Cooldown - Skip topics moved in the last
cooldown_seconds(default 60 seconds) - Load score - Prefer lightest topics (easier to move, less disruption)
Topic Load Scoring:
load_score = (topic_count Γ 0.2) +
(throughput_mbps Γ 0.3) +
(connections Γ 0.3) +
(backlog/10000 Γ 0.2)
Topics are sorted by score (ascending), and the lightest topic is selected first.
Move Limits:
- Maximum 1 topic moved per rebalancing cycle (single-move design)
- Maximum N moves per hour (rate limiting, configurable)
- Cooldown delay between individual moves (prevents oscillations)
Target Broker Selection
For each topic to move, the Load Manager selects a target broker:
- Get rankings - Sorted list of brokers by load (ascending)
- Filter active - Exclude brokers in draining/drained state
- Exclude source - Can't move topic to same broker
- Select first - Pick the least loaded broker meeting criteria
If no suitable target exists (e.g., only one broker active), the move is skipped with a warning.
Move Execution
The Load Manager executes moves using the existing graceful unload workflow:
Step 1: Create Unassigned Marker
Path: /cluster/unassigned/{namespace}/{topic}
Value: {
"reason": "rebalance",
"from_broker": 12345,
"to_broker": 67890,
"timestamp": 1234567890
}
The to_broker hint tells the assignment logic to prefer broker 67890 (if active).
Step 2: Delete Source Assignment
This triggers the source broker's topic watcher, which:
- Seals the topic (no new producers)
- Drains in-flight messages
- Unloads the topic from memory
- Releases resources
Step 3: Target Receives Assignment
The Load Manager's assignment logic sees the unassigned marker and:
- Checks if target broker (67890) is active
- If active β Assign to target broker (respecting hint)
- If not active β Select alternative broker from rankings
Step 4: Topic Loads on Target
The target broker:
- Receives assignment from ETCD watch
- Loads topic metadata
- Initializes subscriptions
- Accepts producer reconnections
- Begins serving traffic
Step 5: Record Move in History
The Load Manager records the move:
Path: /cluster/rebalancing_history/{timestamp}
Value: {
"topic": "/default/my-topic",
"from_broker": 12345,
"to_broker": 67890,
"reason": "LoadImbalance",
"estimated_load": 15.3,
"timestamp": 1234567890
}
This history is used for:
- Rate limiting (counting moves in last hour)
- Cooldown enforcement (preventing rapid re-moves)
- Audit trail (troubleshooting and analysis)
- Metrics (tracking rebalancing activity)
Step 6: Cooldown Delay
After a successful move, the Load Manager waits for cooldown_seconds before the next move. This prevents:
- Rapid oscillations (topic bouncing between brokers)
- Metadata store overload (too many writes)
- Cluster instability (constant topic movements)
Rebalancing Visualization
Imbalance Detected (CV > threshold)
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Select Topics to Move β
β β
β Broker 1 (overloaded): β
β ββ topic-A: load = 5.2 β
β ββ topic-B: load = 8.1 β
β ββ topic-C: load = 12.3 β
β β
β Select lightest: topic-A β
ββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Select Target Broker β
β β
β Rankings: β
β Broker 3: load = 20 β target β
β Broker 2: load = 30 β
β Broker 1: load = 45 β
ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Create Move Plan β
β topic-A: Broker 1 β Broker 3 β
βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Execute Move β
β 1. Create unassigned marker β
β with target hint β
β 2. Delete assignment β
β from Broker 1 β
βββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββ¬βββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββββ ββββββββββββ
β Broker 1 β βLoad Managerβ β Broker 3 β
β β β β β β
β Unload β β Assign to β β Load β
β topic-A β β Broker 3 β β topic-A β
ββββββββββββ ββββββββββββββ ββββββββββββ
β β β
ββββββββββββββββββββ΄βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Record in β
β History β
β Apply Cooldown β
βββββββββββββββββββ
Safety Mechanisms
Automated rebalancing includes multiple safety mechanisms to prevent disruption:
1. Single Move Per Cycle
Only one topic is moved per rebalancing check cycle. This prevents:
- Overshooting (moving too many topics at once)
- Cascading imbalances (fixing one problem creates another)
- Metadata store overload
2. Rate Limiting
Maximum moves per hour enforced via history tracking. Prevents rebalancing storms during cluster instability.
3. Cooldown Period
Topics can't be moved again within cooldown_seconds (default 60s). Prevents oscillations where topics bounce back and forth.
4. Minimum Topic Age
Topics younger than min_topic_age_seconds (default 300s) are never moved. Allows topics to stabilize after creation before considering them for rebalancing.
5. Blacklist
Critical topics can be permanently excluded from rebalancing using pattern matching.
6. Leader-Only Execution
Only the elected cluster leader performs rebalancing. Prevents multiple brokers from executing conflicting moves.
7. Active Broker Check
Target brokers must be in "active" state. Draining or drained brokers are excluded.
8. Graceful Unload
Topic moves use the same graceful unload workflow as manual operations:
- Seal topic (no new producers)
- Drain in-flight messages
- Unload cleanly
9. Metrics Logging
All rebalancing activity is logged with full context for troubleshooting and auditing.
Configuration and Tuning
Basic Configuration
Load Manager configuration in config/danube_broker.yml:
load_manager:
# How often brokers report load to ETCD
load_report_interval_seconds: 30
# Assignment strategy for new topics
# Options: fair, balanced, weighted_load
assignment_strategy: balanced
# Automated rebalancing settings
rebalancing:
enabled: false # Start disabled, enable after testing
aggressiveness: balanced # conservative | balanced | aggressive
check_interval_seconds: 300 # How often to check balance (5 min)
max_moves_per_hour: 10 # Rate limit
cooldown_seconds: 60 # Wait between moves
min_brokers_for_rebalance: 2 # Need at least 2 brokers
min_topic_age_seconds: 300 # Don't move young topics (5 min)
blacklist_topics: [] # Topics to never rebalance
Assignment Strategy Selection
| Scenario | Recommended Strategy | Reason |
|---|---|---|
| Development/testing | fair |
Simple, predictable, low overhead |
| Production (general) | balanced |
Best balance of accuracy and overhead |
| Heterogeneous hardware | balanced |
Accounts for CPU/memory differences |
| Uniform hardware | fair or balanced |
Fair works well when brokers are identical |
| Variable workloads | weighted_load |
Adapts to changing resource bottlenecks |
| High-throughput topics | balanced or weighted_load |
Considers throughput in scoring |
Aggressiveness Tuning
Conservative - Use when:
- Cluster is mostly stable
- Topic moves are expensive (large backlogs)
- Minimizing disruption is critical
- Resources are plentiful
Balanced - Use when:
- General production workloads
- Normal mix of topic sizes
- Moderate load variability
- Default choice for most clusters
Aggressive - Use when:
- Workloads change frequently
- Fast response to imbalance is needed
- Testing/development environments
- Cluster has capacity for frequent moves
Load Report Interval Tuning
Lower intervals (5-10s):
- Faster response to load changes
- More accurate rankings
- Higher ETCD traffic
- Better for testing
Higher intervals (30-60s):
- Lower overhead
- Suitable for stable production clusters
- Slower response to changes
- Industry standard
Recommendation: Start with 30s, reduce only if needed for fast-changing workloads.
Rate Limiting Tuning
The max_moves_per_hour setting prevents rebalancing storms:
Conservative:
max_moves_per_hour: 5- Maximum 5 topics moved per hour
- Best for large topics with backlogs
Balanced:
max_moves_per_hour: 10- Moderate rebalancing speed
- Good for most clusters
Aggressive:
max_moves_per_hour: 20- Faster rebalancing
- Suitable for test clusters or small topics
Blacklist Patterns
Exclude critical topics from rebalancing:
blacklist_topics:
# System topics
- "/system/*"
# Critical low-latency topics
- "/default/critical-*"
# High-volume topics (expensive to move)
- "/analytics/logs"
- "/metrics/*"
# Topics with specific placement requirements
- "/pinned/special-topic"
Metrics and Monitoring
Load Manager Metrics
The Load Manager exposes Prometheus metrics for observability:
Cluster Health Metrics:
danube_cluster_balance_cv # Current coefficient of variation
danube_cluster_balance_mean_load # Mean broker load
danube_cluster_balance_max_load # Max broker load
danube_cluster_balance_min_load # Min broker load
danube_cluster_balance_std_dev # Standard deviation
danube_cluster_active_brokers # Count of active brokers
Rebalancing Metrics:
danube_rebalancing_checks_total # Total balance check cycles
danube_rebalancing_moves_total # Total topic moves executed
danube_rebalancing_failures_total # Failed move attempts
danube_rebalancing_cycle_duration_seconds # Duration of rebalancing cycles
Topic Assignment Metrics:
danube_topic_assignments_total # Total topic assignments
labels: broker_id, action (assign/unassign)
Broker Load Metrics:
danube_broker_topics_owned # Topics per broker
danube_broker_cpu_usage # CPU percentage
danube_broker_memory_usage # Memory percentage
danube_broker_throughput_mbps # Aggregate throughput
Use Cases and Best Practices
Scenario 1: Adding Brokers to a Cluster
Situation:
You start with 3 brokers, each handling 100 topics. You add 2 new brokers to scale capacity.
Without Rebalancing:
- New brokers sit idle (0 topics each)
- Old brokers remain at 100 topics
- Cluster is imbalanced (CV = 100%)
- New capacity is wasted
With Rebalancing:
- New brokers join and register
- Load Manager detects imbalance (CV > threshold)
- Selects 40 lightest topics from old brokers
- Moves them to new brokers over time (rate limited)
- Final state: 60 topics per broker (balanced)
- CV drops to < 10%
Best Practice:
- Enable rebalancing before adding brokers
- Use aggressive mode temporarily for faster rebalancing
- Monitor rebalancing progress via metrics
- Return to balanced mode after completion
Scenario 2: Handling Broker Failure
Situation:
A broker crashes, and its 150 topics need reassignment.
Without Load Manager:
- Manual intervention required
- Topics go offline until reassigned
- Risk of creating new hotspots
With Load Manager:
- Leader detects broker deregistration
- Deletes all assignments from failed broker
- Creates 150 unassigned markers
- Assignment workflow distributes topics across remaining brokers
- Topics reassigned in < 1 minute
Best Practice:
- Ensure at least 3 brokers for redundancy
- Use balanced or weighted_load strategy for failover
- Monitor failover metrics
- Set appropriate
min_brokers_for_rebalance
Scenario 3: Workload Changes Over Time
Situation:
Topics created in the morning (low traffic) grow to high traffic in the afternoon, creating imbalance.
Without Rebalancing:
- Initial assignment is fair
- As traffic grows, some brokers become hotspots
- Performance degrades on overloaded brokers
With Rebalancing:
- Load reports show increasing CPU/throughput on some brokers
- CV rises above threshold
- Rebalancing moves high-traffic topics to underloaded brokers
- Cluster remains balanced throughout the day
Best Practice:
- Use balanced or weighted_load strategy (accounts for throughput)
- Enable rebalancing for dynamic workloads
- Set check_interval to match workload variability (5-10 minutes)
- Use balanced aggressiveness
Scenario 4: Heterogeneous Hardware
Situation:
Cluster has mix of hardware: 2 high-memory brokers, 3 standard brokers.
Without Intelligence:
- Fair strategy distributes topics evenly
- High-memory brokers are underutilized
- Standard brokers may run out of memory
With Balanced Strategy:
- Memory usage is factored into scoring (35% weight)
- High-memory brokers receive more topics
- Standard brokers stay within capacity
- Resources are optimized
Best Practice:
- Always use balanced or weighted_load for heterogeneous clusters
- Monitor per-broker resource utilization
- Adjust load_report_interval if needed (lower = more responsive)
Troubleshooting
Rebalancing Not Triggering
Symptoms:
- CV is high but no moves occur
- Logs show "cluster is balanced"
Possible Causes:
- Rebalancing disabled: Check
rebalancing.enabled: true - Not leader: Only leader executes rebalancing
- CV below threshold: Increase aggressiveness or lower threshold
- Too few brokers: Check
min_brokers_for_rebalance - All topics blacklisted: Review blacklist patterns
Topics Moving Too Frequently
Symptoms:
- Topics move multiple times per hour
- Cluster never stabilizes
- High metadata store traffic
Possible Causes:
- Cooldown too short: Increase
cooldown_seconds - Threshold too low: Use less aggressive mode
- Topic age check disabled: Set
min_topic_age_seconds > 0 - Workload is truly unstable
Imbalance After Rebalancing
Symptoms:
- Rebalancing executes but CV remains high
- Moves don't reduce imbalance
Possible Causes:
- Moving wrong topics (heavy topics not moving)
- New traffic creating imbalance faster than moves fix it
- Assignment strategy doesn't match rebalancing goals
Solutions:
- Review topic load scores (ensure heaviest topics are candidates)
- Increase
max_moves_per_hour - Switch to weighted_load strategy
- Check for blacklisted heavy topics
Summary
The Danube Load Manager is a sophisticated system that keeps broker clusters healthy and balanced automatically. By combining intelligent topic assignment with proactive rebalancing, it ensures:
- Optimal resource utilization - No broker is overloaded or idle
- High availability - Automatic failover when brokers crash
- Performance consistency - Even load distribution prevents hotspots
- Operational simplicity - Self-optimizing cluster reduces manual intervention
- Scalability - Easy to add/remove brokers without manual rebalancing
The system is production-ready with multiple safety mechanisms, configurable strategies, and comprehensive monitoring. Start with defaults (balanced strategy, disabled rebalancing) and gradually tune based on your workload characteristics.