Load Balancing System Documentation¶
Table of Contents¶
- Load Balancing Overview
- Load Balancing Strategies
- Health Score Algorithm
- Speed Testing Integration
- Failover Logic
- Configuration and Tuning
- Monitoring and Metrics
- Advanced Features
- API Reference
- Troubleshooting
Load Balancing Overview¶
Purpose and Benefits¶
The VPN Exit Controller implements intelligent load balancing to distribute traffic across multiple VPN exit nodes within each country. This provides several key benefits:
- High Availability: Automatic failover when nodes become unhealthy
- Performance Optimization: Route traffic to the fastest available nodes
- Scalability: Automatic scaling based on connection load
- Resource Efficiency: Optimal utilization of compute resources
- Geographic Distribution: Balanced load across different VPN servers
Integration with Failover Systems¶
The load balancer works closely with the failover manager to ensure service continuity:
# Example: Load balancer + failover integration
if not healthy_nodes:
# Trigger failover to different VPN server
await failover_manager.handle_node_failure(node_id, "no_healthy_nodes")
# Recheck for healthy nodes after failover
healthy_nodes = self._get_healthy_nodes_for_country(country)
Load Balancing Strategies¶
The system supports five distinct load balancing strategies, each optimized for different scenarios:
1. Round Robin Strategy¶
Purpose: Simple, fair distribution of connections across all healthy nodes.
Algorithm:
async def _round_robin_select(self, nodes: List[Dict], country: str) -> Dict:
"""Round-robin selection"""
if country not in self.round_robin_counters:
self.round_robin_counters[country] = 0
selected_index = self.round_robin_counters[country] % len(nodes)
self.round_robin_counters[country] += 1
return nodes[selected_index]
Best For: - Evenly distributed workloads - Testing scenarios - When all nodes have similar performance characteristics
Characteristics: - Maintains per-country counters - Guarantees fair distribution - No performance consideration
2. Least Connections Strategy¶
Purpose: Route new connections to the node with the fewest active connections.
Algorithm:
async def _least_connections_select(self, nodes: List[Dict], country: str) -> Dict:
"""Select node with least connections"""
node_connections = []
for node in nodes:
connection_count = redis_manager.get_connection_count(node['id'])
node_connections.append((node, connection_count))
# Sort by connection count (ascending)
node_connections.sort(key=lambda x: x[1])
return node_connections[0][0]
Best For: - Long-lived connections - Scenarios where connection duration varies significantly - Optimizing connection distribution
Monitoring:
# Check connection counts via API
curl -u admin:password http://localhost:8080/api/load-balancer/stats
3. Weighted Latency Strategy¶
Purpose: Route traffic based on server latency with weighted randomization.
Algorithm:
async def _weighted_latency_select(self, nodes: List[Dict], country: str) -> Dict:
"""Select based on weighted latency scores"""
node_scores = []
for node in nodes:
# Get server latency from Redis
server_health = redis_manager.get_server_health(node.get('vpn_server', ''))
latency = server_health.get('latency', 100) if server_health else 100
# Lower latency = higher weight
weight = max(1, 200 - latency) # Weight between 1-199
node_scores.append((node, weight))
# Weighted random selection
total_weight = sum(score[1] for score in node_scores)
random_point = random.uniform(0, total_weight)
current_weight = 0
for node, weight in node_scores:
current_weight += weight
if current_weight >= random_point:
return node
Weight Calculation: - Latency 50ms → Weight 150 - Latency 100ms → Weight 100
- Latency 150ms → Weight 50 - Latency 200ms+ → Weight 1
Best For: - Latency-sensitive applications - Real-time communications - Gaming or streaming workloads
4. Random Strategy¶
Purpose: Randomly distribute connections for simple load distribution.
Algorithm:
async def _random_select(self, nodes: List[Dict], country: str) -> Dict:
"""Random selection"""
return random.choice(nodes)
Best For: - Simple load distribution - Development and testing - When other strategies are not applicable
5. Health Score Strategy (Default)¶
Purpose: Select nodes based on comprehensive health scores considering multiple factors.
Algorithm:
async def _health_score_select(self, nodes: List[Dict], country: str) -> Dict:
"""Select based on comprehensive health score"""
node_scores = []
for node in nodes:
score = await self._calculate_node_health_score(node)
node_scores.append((node, score))
# Sort by score (descending - higher is better)
node_scores.sort(key=lambda x: x[1], reverse=True)
return node_scores[0][0]
Health Score Algorithm¶
The health score algorithm provides a comprehensive assessment of node performance by weighing multiple factors:
Score Calculation¶
async def _calculate_node_health_score(self, node: Dict) -> float:
"""Calculate comprehensive health score for a node"""
score = 100.0 # Start with perfect score
# Factor 1: Server latency (40% weight)
server_health = redis_manager.get_server_health(node.get('vpn_server', ''))
if server_health:
latency = server_health.get('latency', 100)
# Score: 100ms=90, 50ms=95, 200ms=80
latency_score = max(50, 100 - (latency - 50) * 0.5)
score = score * 0.6 + latency_score * 0.4
# Factor 2: Connection count (30% weight)
connection_count = redis_manager.get_connection_count(node['id'])
# Penalize high connection counts
connection_penalty = min(20, connection_count * 2)
connection_score = max(60, 100 - connection_penalty)
score = score * 0.7 + connection_score * 0.3
# Factor 3: CPU usage (20% weight)
stats = node.get('stats', {})
cpu_percent = stats.get('cpu_percent', 0)
cpu_score = max(60, 100 - cpu_percent)
score = score * 0.8 + cpu_score * 0.2
# Factor 4: Memory usage (10% weight)
memory_mb = stats.get('memory_mb', 0)
# Penalize if using > 300MB
memory_penalty = max(0, (memory_mb - 300) / 10)
memory_score = max(70, 100 - memory_penalty)
score = score * 0.9 + memory_score * 0.1
return score
Scoring Factors¶
| Factor | Weight | Description | Range |
|---|---|---|---|
| Server Latency | 40% | Network latency to VPN server | 50-100 |
| Connection Count | 30% | Number of active connections | 60-100 |
| CPU Usage | 20% | Container CPU utilization | 60-100 |
| Memory Usage | 10% | Container memory consumption | 70-100 |
Score Interpretation¶
- 90-100: Excellent performance, optimal for routing
- 80-89: Good performance, suitable for most traffic
- 70-79: Acceptable performance, may experience delays
- 60-69: Poor performance, consider failover
- <60: Critical issues, automatic failover triggered
Health Score Thresholds¶
# Configuration examples
EXCELLENT_THRESHOLD = 90.0
GOOD_THRESHOLD = 80.0
ACCEPTABLE_THRESHOLD = 70.0
POOR_THRESHOLD = 60.0
CRITICAL_THRESHOLD = 50.0
Speed Testing Integration¶
Speed testing provides crucial data for load balancing decisions through comprehensive performance evaluation.
Speed Test Components¶
Download Speed Testing¶
async def _test_download_speed(self, node_id: str, test_url: str) -> Dict:
"""Test download speed by downloading a file inside the container"""
# Create curl command to test download speed
curl_cmd = [
"curl", "-s", "-w",
"%{time_total},%{speed_download},%{size_download}",
"-o", "/dev/null",
"--max-time", "60", # 60 second timeout
test_url
]
container = self.docker_manager.client.containers.get(node_id)
result = container.exec_run(curl_cmd, demux=False)
# Parse results and convert to Mbps
time_total, speed_download, size_download = output.split(',')
mbps = (float(speed_download) * 8) / (1024 * 1024)
return {
'mbps': mbps,
'time_seconds': float(time_total),
'size_bytes': float(size_download)
}
Latency Testing¶
async def _test_latency(self, node_id: str) -> Dict:
"""Test latency to multiple endpoints"""
ping_endpoints = [
"https://www.google.com",
"https://www.cloudflare.com",
"https://www.github.com",
"https://httpbin.org/ip"
]
# Test each endpoint and calculate average
successful_tests = []
for endpoint in ping_endpoints:
# Use curl to measure connection time
latency_ms = float(connect_time) * 1000
successful_tests.append(latency_ms)
avg_latency = sum(successful_tests) / len(successful_tests)
return {'avg_latency': avg_latency, 'tests': latency_tests}
Speed Test Scheduling¶
# Automatic speed testing
async def schedule_speed_tests():
"""Run speed tests on all nodes every hour"""
while True:
try:
results = await speed_tester.test_all_nodes("1MB")
logger.info(f"Speed tests completed: {len(results)} nodes tested")
except Exception as e:
logger.error(f"Speed test cycle failed: {e}")
await asyncio.sleep(3600) # 1 hour interval
Historical Data Usage¶
Speed test results are stored in Redis with time-series data:
def _store_speed_test_result(self, node_id: str, result: Dict):
"""Store speed test result in Redis"""
# Store latest result (1 hour TTL)
key = f"speedtest:{node_id}:latest"
redis_manager.client.setex(key, 3600, json.dumps(result))
# Store in history (keep last 24 hours)
history_key = f"speedtest:{node_id}:history"
timestamp = datetime.utcnow().timestamp()
redis_manager.client.zadd(history_key, {json.dumps(result): timestamp})
# Remove old entries (older than 24 hours)
cutoff = (datetime.utcnow() - timedelta(hours=24)).timestamp()
redis_manager.client.zremrangebyscore(history_key, 0, cutoff)
Performance Trend Analysis¶
def analyze_performance_trends(node_id: str) -> Dict:
"""Analyze performance trends over time"""
history = speed_tester.get_speed_test_history(node_id, hours=24)
if len(history) < 2:
return {"trend": "insufficient_data"}
speeds = [h['download_mbps'] for h in history if 'download_mbps' in h]
latencies = [h['latency_ms'] for h in history if 'latency_ms' in h]
# Calculate trends
speed_trend = "improving" if speeds[-1] > speeds[0] else "degrading"
latency_trend = "improving" if latencies[-1] < latencies[0] else "degrading"
return {
"speed_trend": speed_trend,
"latency_trend": latency_trend,
"avg_speed_24h": sum(speeds) / len(speeds),
"avg_latency_24h": sum(latencies) / len(latencies)
}
Failover Logic¶
The failover system ensures service continuity when nodes become unhealthy or disconnected.
Automatic Failover Triggers¶
- VPN Connection Failure: Node loses connection to VPN server
- High Resource Usage: CPU > 90% or Memory > 1GB for 5 minutes
- Network Connectivity Issues: Cannot reach test endpoints
- Container Health Check Failure: Docker health checks fail
Failover Process¶
async def handle_node_failure(self, node_id: str, failure_reason: str) -> bool:
"""Handle a failed node by attempting failover to a different server"""
# Check if failover already in progress
if node_id in self.failover_in_progress:
return False
self.failover_in_progress.add(node_id)
try:
# Get node details and alternative server
node = self.docker_manager.get_node_details(node_id)
country = node['country']
current_server = node['server']
# Check failover limits
if not self._can_failover(node_id):
return False
# Get alternative server
new_server = await self._get_alternative_server(country, current_server)
if not new_server:
return False
# Perform failover
success = await self._perform_failover(node_id, country, new_server)
# Record attempt
self._record_failover_attempt(node_id, country, current_server, new_server, success)
return success
finally:
self.failover_in_progress.discard(node_id)
Failover Constraints¶
class FailoverManager:
def __init__(self):
self.max_failover_attempts = 3 # Max attempts per hour
self.failover_cooldown = 300 # 5 minutes between attempts
self.failover_history = {} # Track attempts per node
Server Selection for Failover¶
async def _get_alternative_server(self, country: str, exclude_server: str) -> Optional[str]:
"""Get an alternative server for failover"""
# Get all servers for the country
servers = vpn_server_manager.get_servers_for_country(country)
# Filter out current and blacklisted servers
available_servers = [
s for s in servers
if s['hostname'] != exclude_server
and not redis_manager.is_server_blacklisted(s['hostname'])
]
# Sort by health score
available_servers.sort(key=lambda s: s.get('health_score', 50), reverse=True)
# Test top 3 servers
for server in available_servers[:3]:
success, latency = await vpn_server_manager.health_check_server(server['hostname'])
if success:
return server['hostname']
return available_servers[0]['hostname'] if available_servers else None
Recovery Procedures¶
- Immediate Recovery: Stop failed container, start new one with different server
- Graceful Recovery: Wait for existing connections to drain before switching
- Rollback Recovery: Return to previous working server if new server fails
Configuration and Tuning¶
Load Balancing Parameters¶
# /opt/vpn-exit-controller/.env
LOAD_BALANCER_STRATEGY=health_score
LOAD_BALANCER_ENABLED=true
MAX_NODES_PER_COUNTRY=3
AUTO_SCALE_ENABLED=true
SCALE_UP_THRESHOLD=50 # connections per node
SCALE_DOWN_THRESHOLD=10 # connections per node
Health Check Intervals¶
# Configuration in services/metrics_collector.py
class MetricsCollector:
def __init__(self, interval_seconds: int = 30): # Collect every 30 seconds
self.interval = interval_seconds
Performance Thresholds¶
# CPU and memory thresholds for scaling decisions
CPU_THRESHOLD_HIGH = 80.0 # Scale up trigger
CPU_THRESHOLD_LOW = 20.0 # Scale down trigger
MEMORY_THRESHOLD_HIGH = 500 # MB, scale up trigger
MEMORY_THRESHOLD_LOW = 300 # MB, scale down trigger
Auto-scaling Configuration¶
async def start_additional_node_if_needed(self, country: str) -> bool:
"""Start additional node if load is high"""
nodes = self._get_healthy_nodes_for_country(country)
if not nodes:
return False
# Check if we need more capacity
total_connections = sum(redis_manager.get_connection_count(n['id']) for n in nodes)
avg_connections_per_node = total_connections / len(nodes)
# Start new node if average > 50 connections per node and < 3 nodes
if avg_connections_per_node > 50 and len(nodes) < 3:
logger.info(f"High load detected for {country}, starting additional node")
# Start new node...
return True
return False
Tuning Recommendations¶
| Scenario | Strategy | Max Nodes | Thresholds |
|---|---|---|---|
| High Traffic | health_score | 5 | Scale up: 30 conn/node |
| Low Latency | weighted_latency | 3 | Scale up: 20 conn/node |
| Cost Optimized | least_connections | 2 | Scale up: 80 conn/node |
| Testing | round_robin | 3 | Scale up: 50 conn/node |
Monitoring and Metrics¶
Key Metrics Collection¶
The system continuously collects metrics for load balancing decisions:
class MetricsCollector:
"""Background service that continuously collects metrics from all nodes"""
async def _collect_node_metrics(self, node_id: str):
"""Collect metrics for a single node"""
# Get detailed node info (includes Docker stats)
node_details = self.docker_manager.get_node_details(node_id)
# Check for anomalies
if node_details.get('stats'):
stats = node_details['stats']
# Alert on high resource usage
if stats.get('cpu_percent', 0) > 80:
logger.warning(f"High CPU usage on node {node_id}: {stats['cpu_percent']:.1f}%")
if stats.get('memory_mb', 0) > 500:
logger.warning(f"High memory usage on node {node_id}: {stats['memory_mb']:.1f}MB")
Load Balancing Statistics¶
def get_load_balancing_stats(self) -> Dict:
"""Get comprehensive load balancing statistics"""
stats = {
'strategies': [s.value for s in LoadBalancingStrategy],
'round_robin_counters': self.round_robin_counters,
'countries': {}
}
# Get stats per country
all_nodes = self.docker_manager.list_nodes()
countries = set(n['country'] for n in all_nodes)
for country in countries:
nodes = self._get_healthy_nodes_for_country(country)
total_connections = sum(redis_manager.get_connection_count(n['id']) for n in nodes)
stats['countries'][country] = {
'node_count': len(nodes),
'total_connections': total_connections,
'avg_connections_per_node': total_connections / len(nodes) if nodes else 0,
'nodes': [
{
'id': n['id'],
'server': n.get('vpn_server', 'unknown'),
'connections': redis_manager.get_connection_count(n['id']),
'tailscale_ip': n.get('tailscale_ip'),
'cpu_percent': n.get('stats', {}).get('cpu_percent', 0),
'health_score': await self._calculate_node_health_score(n)
}
for n in nodes
]
}
return stats
Performance Monitoring¶
# Monitor load balancing in real-time
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq
# Get speed test summary
curl -u admin:password http://localhost:8080/api/speed-test/summary | jq
# Monitor metrics
curl -u admin:password http://localhost:8080/api/metrics/current | jq
Alert Conditions¶
| Condition | Threshold | Action |
|---|---|---|
| High CPU Usage | >80% for 5 min | Scale up or failover |
| High Memory | >500MB | Scale up or failover |
| High Connection Count | >100 per node | Scale up |
| Low Speed | <10 Mbps | Investigate/failover |
| High Latency | >200ms | Switch strategy or failover |
| Node Down | Health check fails | Immediate failover |
Reporting and Analysis¶
# Generate load balancing report
async def generate_load_balancing_report(hours: int = 24) -> Dict:
"""Generate comprehensive load balancing report"""
report = {
'period_hours': hours,
'generated_at': datetime.utcnow().isoformat(),
'summary': {},
'by_country': {},
'performance_trends': {},
'recommendations': []
}
# Analyze each country
countries = get_all_countries()
for country in countries:
nodes = get_nodes_for_country(country)
# Calculate statistics
total_connections = sum(get_connection_count(n['id']) for n in nodes)
avg_speed = calculate_avg_speed(nodes, hours)
avg_latency = calculate_avg_latency(nodes, hours)
report['by_country'][country] = {
'node_count': len(nodes),
'total_connections': total_connections,
'avg_speed_mbps': avg_speed,
'avg_latency_ms': avg_latency,
'failover_events': count_failover_events(country, hours)
}
# Generate recommendations
if avg_speed < 20:
report['recommendations'].append(f"Consider adding more nodes to {country} - low speed detected")
if total_connections / len(nodes) > 50:
report['recommendations'].append(f"Scale up {country} - high load detected")
return report
Advanced Features¶
Connection Affinity/Sticky Sessions¶
class ConnectionAffinity:
"""Manage connection affinity for consistent routing"""
def __init__(self):
self.client_node_map = {} # client_ip -> node_id
self.affinity_timeout = 3600 # 1 hour
async def get_affinity_node(self, client_ip: str, country: str) -> Optional[str]:
"""Get node with existing affinity for client"""
affinity_key = f"affinity:{client_ip}:{country}"
node_id = redis_manager.client.get(affinity_key)
if node_id:
# Check if node is still healthy
healthy, _ = docker_manager.check_container_health(node_id)
if healthy:
# Refresh affinity timeout
redis_manager.client.expire(affinity_key, self.affinity_timeout)
return node_id
else:
# Remove stale affinity
redis_manager.client.delete(affinity_key)
return None
async def set_affinity(self, client_ip: str, country: str, node_id: str):
"""Set client affinity to specific node"""
affinity_key = f"affinity:{client_ip}:{country}"
redis_manager.client.setex(affinity_key, self.affinity_timeout, node_id)
Geographic Routing Preferences¶
class GeographicRouter:
"""Route based on geographic preferences"""
REGION_PREFERENCES = {
'americas': ['us', 'ca', 'br'],
'europe': ['de', 'uk', 'fr', 'nl'],
'asia': ['jp', 'sg', 'hk', 'au'],
'africa': ['za'],
'oceania': ['au', 'nz']
}
async def get_preferred_country(self, client_region: str, requested_country: str) -> str:
"""Get preferred country based on client region"""
# Return requested country if available and healthy
if self.is_country_healthy(requested_country):
return requested_country
# Find alternative in same region
preferred_countries = self.REGION_PREFERENCES.get(client_region, [])
for country in preferred_countries:
if self.is_country_healthy(country):
logger.info(f"Routing {client_region} client to {country} instead of {requested_country}")
return country
# Fallback to any healthy country
return self.get_any_healthy_country()
Custom Load Balancing Rules¶
class CustomLoadBalancingRules:
"""Implement custom load balancing rules"""
def __init__(self):
self.rules = []
def add_rule(self, rule: Dict):
"""Add custom routing rule"""
self.rules.append({
'id': str(uuid4()),
'name': rule['name'],
'condition': rule['condition'],
'action': rule['action'],
'priority': rule.get('priority', 100),
'enabled': rule.get('enabled', True)
})
async def evaluate_rules(self, context: Dict) -> Optional[str]:
"""Evaluate rules and return target node"""
# Sort by priority
active_rules = sorted(
[r for r in self.rules if r['enabled']],
key=lambda x: x['priority']
)
for rule in active_rules:
if self._matches_condition(rule['condition'], context):
return await self._execute_action(rule['action'], context)
return None
def _matches_condition(self, condition: Dict, context: Dict) -> bool:
"""Check if context matches rule condition"""
# Example conditions:
# {"source_device": "iPhone", "domain": "*.streaming.com"}
# {"time_range": "09:00-17:00", "country": "us"}
# {"client_ip_range": "192.168.1.0/24"}
for key, value in condition.items():
if key == 'source_device':
if context.get('user_agent', '').find(value) == -1:
return False
elif key == 'domain':
if not fnmatch.fnmatch(context.get('domain', ''), value):
return False
elif key == 'time_range':
current_time = datetime.now().strftime('%H:%M')
start, end = value.split('-')
if not (start <= current_time <= end):
return False
return True
API-based Load Balancing Control¶
# Extended API endpoints for advanced control
@router.post("/rules")
async def create_load_balancing_rule(rule: CustomRule, user=Depends(verify_auth)):
"""Create custom load balancing rule"""
custom_rules.add_rule(rule.dict())
return {"status": "rule_created", "rule": rule}
@router.put("/strategy/{country}")
async def set_country_strategy(
country: str,
strategy: LoadBalancingStrategy,
user=Depends(verify_auth)
):
"""Set load balancing strategy for specific country"""
load_balancer.set_country_strategy(country, strategy)
return {"country": country, "strategy": strategy.value}
@router.post("/rebalance/{country}")
async def force_rebalance(country: str, user=Depends(verify_auth)):
"""Force rebalancing of connections in a country"""
result = await load_balancer.rebalance_country(country)
return {"country": country, "rebalanced_connections": result}
@router.get("/prediction/{country}")
async def get_load_prediction(country: str, hours: int = 1, user=Depends(verify_auth)):
"""Get load prediction for next N hours"""
prediction = await load_balancer.predict_load(country, hours)
return prediction
API Reference¶
Load Balancer Endpoints¶
Get Load Balancing Statistics¶
Response:
{
"strategies": ["round_robin", "least_connections", "weighted_latency", "random", "health_score"],
"round_robin_counters": {"us": 5, "uk": 2},
"countries": {
"us": {
"node_count": 2,
"total_connections": 45,
"avg_connections_per_node": 22.5,
"nodes": [
{
"id": "container_123",
"server": "us5063.nordvpn.com",
"connections": 25,
"tailscale_ip": "100.73.33.15",
"cpu_percent": 45.2,
"health_score": 87.3
}
]
}
}
}
Get Best Node for Country¶
Response:
{
"selected_node": {
"id": "container_123",
"country": "us",
"server": "us5063.nordvpn.com",
"tailscale_ip": "100.73.33.15",
"health_score": 87.3
},
"strategy": "health_score",
"country": "us"
}
Scale Up Country¶
Scale Down Country¶
Get Available Strategies¶
Response:
{
"strategies": [
{
"name": "round_robin",
"description": "Distributes requests evenly across all healthy nodes"
},
{
"name": "least_connections",
"description": "Routes to the node with fewest active connections"
},
{
"name": "weighted_latency",
"description": "Routes based on server latency with weighted randomization"
},
{
"name": "random",
"description": "Randomly selects from available healthy nodes"
},
{
"name": "health_score",
"description": "Routes to node with best overall health score (CPU, memory, latency, connections)"
}
]
}
Troubleshooting¶
Common Issues¶
1. No Healthy Nodes Available¶
Symptoms: - API returns 404 "No healthy nodes available" - Load balancer cannot route traffic
Diagnosis:
# Check node health
curl -u admin:password http://localhost:8080/api/nodes/list | jq '.[] | select(.status == "running")'
# Check container health
docker ps --filter "label=vpn-exit-node"
# Check VPN connections
docker exec <container_id> curl -s ipinfo.io
Solutions: 1. Restart unhealthy containers: docker restart <container_id> 2. Check VPN credentials in /opt/vpn-exit-controller/configs/auth.txt 3. Verify network connectivity: docker exec <container_id> ping 8.8.8.8 4. Force failover: curl -X POST http://localhost:8080/api/failover/force/<node_id>
2. Load Imbalance¶
Symptoms: - One node has significantly more connections than others - Performance degradation on overloaded nodes
Diagnosis:
# Check connection distribution
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq '.countries'
# Check strategy
curl -u admin:password http://localhost:8080/api/config | jq '.load_balancer'
Solutions: 1. Switch to least_connections strategy 2. Force rebalancing: curl -X POST http://localhost:8080/api/load-balancer/rebalance/<country> 3. Increase connection drain timeout 4. Add more nodes: curl -X POST http://localhost:8080/api/load-balancer/scale-up/<country>
3. Frequent Failovers¶
Symptoms: - High number of failover events in logs - Unstable node assignments
Diagnosis:
# Check failover history
curl -u admin:password http://localhost:8080/api/failover/status | jq
# Check server health
curl -u admin:password http://localhost:8080/api/speed-test/summary | jq
Solutions: 1. Increase failover cooldown period 2. Check VPN server stability 3. Review health check thresholds 4. Blacklist problematic servers
4. Poor Performance¶
Symptoms: - Slow connection speeds - High latency
Diagnosis:
# Run speed tests
curl -X POST -u admin:password http://localhost:8080/api/speed-test/run-all
# Check health scores
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq '.countries[].nodes[].health_score'
Solutions: 1. Switch to weighted_latency strategy
2. Add more nodes in region 3. Use different VPN servers 4. Check network congestion
Debug Commands¶
# Enable debug logging
export LOG_LEVEL=DEBUG
# Check Redis data
redis-cli
> KEYS speedtest:*
> KEYS affinity:*
> KEYS server_health:*
# Monitor load balancer decisions
journalctl -u vpn-controller -f | grep "load_balancer"
# Test specific node
curl -X POST -u admin:password http://localhost:8080/api/speed-test/node/<node_id>
# Force strategy change
curl -X PUT -u admin:password http://localhost:8080/api/load-balancer/strategy/<country> \
-H "Content-Type: application/json" \
-d '{"strategy": "health_score"}'
Performance Optimization Tips¶
- Strategy Selection:
- Use
health_scorefor general purpose - Use
weighted_latencyfor latency-sensitive apps -
Use
least_connectionsfor long-lived connections -
Resource Tuning:
- Monitor CPU/memory usage patterns
- Adjust scaling thresholds based on traffic
-
Set appropriate connection limits
-
Network Optimization:
- Choose VPN servers close to users
- Monitor and blacklist slow servers
-
Use multiple servers per country
-
Monitoring:
- Set up alerts for health score < 70
- Monitor failover frequency
- Track connection distribution
This comprehensive load balancing system ensures optimal performance, reliability, and scalability for the VPN Exit Controller infrastructure.