Load Balancing System Documentation¶

Table of Contents¶

Load Balancing Overview
Load Balancing Strategies
Health Score Algorithm
Speed Testing Integration
Failover Logic
Configuration and Tuning
Monitoring and Metrics
Advanced Features
API Reference
Troubleshooting

Load Balancing Overview¶

Purpose and Benefits¶

The VPN Exit Controller implements intelligent load balancing to distribute traffic across multiple VPN exit nodes within each country. This provides several key benefits:

High Availability: Automatic failover when nodes become unhealthy
Performance Optimization: Route traffic to the fastest available nodes
Scalability: Automatic scaling based on connection load
Resource Efficiency: Optimal utilization of compute resources
Geographic Distribution: Balanced load across different VPN servers

Integration with Failover Systems¶

The load balancer works closely with the failover manager to ensure service continuity:

# Example: Load balancer + failover integration
if not healthy_nodes:
    # Trigger failover to different VPN server
    await failover_manager.handle_node_failure(node_id, "no_healthy_nodes")

# Recheck for healthy nodes after failover
healthy_nodes = self._get_healthy_nodes_for_country(country)

Load Balancing Strategies¶

The system supports five distinct load balancing strategies, each optimized for different scenarios:

1. Round Robin Strategy¶

Purpose: Simple, fair distribution of connections across all healthy nodes.

Algorithm:

async def _round_robin_select(self, nodes: List[Dict], country: str) -> Dict:
    """Round-robin selection"""
    if country not in self.round_robin_counters:
        self.round_robin_counters[country] = 0

    selected_index = self.round_robin_counters[country] % len(nodes)
    self.round_robin_counters[country] += 1

    return nodes[selected_index]

Best For: - Evenly distributed workloads - Testing scenarios - When all nodes have similar performance characteristics

Characteristics: - Maintains per-country counters - Guarantees fair distribution - No performance consideration

2. Least Connections Strategy¶

Purpose: Route new connections to the node with the fewest active connections.

Algorithm:

async def _least_connections_select(self, nodes: List[Dict], country: str) -> Dict:
    """Select node with least connections"""
    node_connections = []

    for node in nodes:
        connection_count = redis_manager.get_connection_count(node['id'])
        node_connections.append((node, connection_count))

    # Sort by connection count (ascending)
    node_connections.sort(key=lambda x: x[1])
    return node_connections[0][0]

Best For: - Long-lived connections - Scenarios where connection duration varies significantly - Optimizing connection distribution

Monitoring:

# Check connection counts via API
curl -u admin:password http://localhost:8080/api/load-balancer/stats

3. Weighted Latency Strategy¶

Purpose: Route traffic based on server latency with weighted randomization.

Algorithm:

id=__span-4-1>async def _weighted_latency_select(self, nodes: List[Dict], country: str) -> Dict: """Select based on weighted latency scores""" node_scores = [] for node in nodes: # Get server latency from Redis server_health = redis_manager.get_server_health(node.get('vpn_server', '')) latency = server_health.get('latency', 100) if server_health else 100 # Lower latency = higher weight weight = max(1, 200 - latency) # Weight between 1-199 node_scores.append((node, weight)) # Weighted random selection total_weight = sum(score[1] for score in node_scores) random_point = random.uniform(0, total_weight) current_weight = 0 for node, weight in node_scores: current_weight += weight if current_weight >= random_point: return node

Weight Calculation: - Latency 50ms → Weight 150 - Latency 100ms → Weight 100
- Latency 150ms → Weight 50 - Latency 200ms+ → Weight 1

Best For: - Latency-sensitive applications - Real-time communications - Gaming or streaming workloads

4. Random Strategy¶

Purpose: Randomly distribute connections for simple load distribution.

Algorithm:

async def _random_select(self, nodes: List[Dict], country: str) -> Dict:
    """Random selection"""
    return random.choice(nodes)

Best For: - Simple load distribution - Development and testing - When other strategies are not applicable

5. Health Score Strategy (Default)¶

Purpose: Select nodes based on comprehensive health scores considering multiple factors.

Algorithm:

async def _health_score_select(self, nodes: List[Dict], country: str) -> Dict:
    """Select based on comprehensive health score"""
    node_scores = []

    for node in nodes:
        score = await self._calculate_node_health_score(node)
        node_scores.append((node, score))

    # Sort by score (descending - higher is better)
    node_scores.sort(key=lambda x: x[1], reverse=True)
    return node_scores[0][0]

Health Score Algorithm¶

The health score algorithm provides a comprehensive assessment of node performance by weighing multiple factors:

Score Calculation¶

async def _calculate_node_health_score(self, node: Dict) -> float:
    """Calculate comprehensive health score for a node"""
    score = 100.0  # Start with perfect score

    # Factor 1: Server latency (40% weight)
    server_health = redis_manager.get_server_health(node.get('vpn_server', ''))
    if server_health:
        latency = server_health.get('latency', 100)
        # Score: 100ms=90, 50ms=95, 200ms=80
        latency_score = max(50, 100 - (latency - 50) * 0.5)
        score = score * 0.6 + latency_score * 0.4

    # Factor 2: Connection count (30% weight) 
    connection_count = redis_manager.get_connection_count(node['id'])
    # Penalize high connection counts
    connection_penalty = min(20, connection_count * 2)
    connection_score = max(60, 100 - connection_penalty)
    score = score * 0.7 + connection_score * 0.3

    # Factor 3: CPU usage (20% weight)
    stats = node.get('stats', {})
    cpu_percent = stats.get('cpu_percent', 0)
    cpu_score = max(60, 100 - cpu_percent)
    score = score * 0.8 + cpu_score * 0.2

    # Factor 4: Memory usage (10% weight)
    memory_mb = stats.get('memory_mb', 0)
    # Penalize if using > 300MB
    memory_penalty = max(0, (memory_mb - 300) / 10)
    memory_score = max(70, 100 - memory_penalty)
    score = score * 0.9 + memory_score * 0.1

    return score

Scoring Factors¶

Factor	Weight	Description	Range
Server Latency	40%	Network latency to VPN server	50-100
Connection Count	30%	Number of active connections	60-100
CPU Usage	20%	Container CPU utilization	60-100
Memory Usage	10%	Container memory consumption	70-100

Score Interpretation¶

90-100: Excellent performance, optimal for routing
80-89: Good performance, suitable for most traffic
70-79: Acceptable performance, may experience delays
60-69: Poor performance, consider failover
<60: Critical issues, automatic failover triggered

Health Score Thresholds¶

# Configuration examples
EXCELLENT_THRESHOLD = 90.0
GOOD_THRESHOLD = 80.0
ACCEPTABLE_THRESHOLD = 70.0
POOR_THRESHOLD = 60.0
CRITICAL_THRESHOLD = 50.0

Speed Testing Integration¶

Speed testing provides crucial data for load balancing decisions through comprehensive performance evaluation.

Speed Test Components¶

Download Speed Testing¶

async def _test_download_speed(self, node_id: str, test_url: str) -> Dict:
    """Test download speed by downloading a file inside the container"""

    # Create curl command to test download speed
    curl_cmd = [
        "curl", "-s", "-w", 
        "%{time_total},%{speed_download},%{size_download}",
        "-o", "/dev/null",
        "--max-time", "60",  # 60 second timeout
        test_url
    ]

    container = self.docker_manager.client.containers.get(node_id)
    result = container.exec_run(curl_cmd, demux=False)

    # Parse results and convert to Mbps
    time_total, speed_download, size_download = output.split(',')
    mbps = (float(speed_download) * 8) / (1024 * 1024)

    return {
        'mbps': mbps,
        'time_seconds': float(time_total),
        'size_bytes': float(size_download)
    }

Latency Testing¶

async def _test_latency(self, node_id: str) -> Dict:
    """Test latency to multiple endpoints"""

    ping_endpoints = [
        "https://www.google.com",
        "https://www.cloudflare.com", 
        "https://www.github.com",
        "https://httpbin.org/ip"
    ]

    # Test each endpoint and calculate average
    successful_tests = []
    for endpoint in ping_endpoints:
        # Use curl to measure connection time
        latency_ms = float(connect_time) * 1000
        successful_tests.append(latency_ms)

    avg_latency = sum(successful_tests) / len(successful_tests)
    return {'avg_latency': avg_latency, 'tests': latency_tests}

Speed Test Scheduling¶

# Automatic speed testing
async def schedule_speed_tests():
    """Run speed tests on all nodes every hour"""
    while True:
        try:
            results = await speed_tester.test_all_nodes("1MB")
            logger.info(f"Speed tests completed: {len(results)} nodes tested")
        except Exception as e:
            logger.error(f"Speed test cycle failed: {e}")

        await asyncio.sleep(3600)  # 1 hour interval

Historical Data Usage¶

Speed test results are stored in Redis with time-series data:

def _store_speed_test_result(self, node_id: str, result: Dict):
    """Store speed test result in Redis"""

    # Store latest result (1 hour TTL)
    key = f"speedtest:{node_id}:latest"
    redis_manager.client.setex(key, 3600, json.dumps(result))

    # Store in history (keep last 24 hours)
    history_key = f"speedtest:{node_id}:history"
    timestamp = datetime.utcnow().timestamp()
    redis_manager.client.zadd(history_key, {json.dumps(result): timestamp})

    # Remove old entries (older than 24 hours)
    cutoff = (datetime.utcnow() - timedelta(hours=24)).timestamp()
    redis_manager.client.zremrangebyscore(history_key, 0, cutoff)

Performance Trend Analysis¶

def analyze_performance_trends(node_id: str) -> Dict:
    """Analyze performance trends over time"""

    history = speed_tester.get_speed_test_history(node_id, hours=24)
    if len(history) < 2:
        return {"trend": "insufficient_data"}

    speeds = [h['download_mbps'] for h in history if 'download_mbps' in h]
    latencies = [h['latency_ms'] for h in history if 'latency_ms' in h]

    # Calculate trends
    speed_trend = "improving" if speeds[-1] > speeds[0] else "degrading"
    latency_trend = "improving" if latencies[-1] < latencies[0] else "degrading"

    return {
        "speed_trend": speed_trend,
        "latency_trend": latency_trend,
        "avg_speed_24h": sum(speeds) / len(speeds),
        "avg_latency_24h": sum(latencies) / len(latencies)
    }

Failover Logic¶

The failover system ensures service continuity when nodes become unhealthy or disconnected.

Automatic Failover Triggers¶

VPN Connection Failure: Node loses connection to VPN server
High Resource Usage: CPU > 90% or Memory > 1GB for 5 minutes
Network Connectivity Issues: Cannot reach test endpoints
Container Health Check Failure: Docker health checks fail

Failover Process¶

async def handle_node_failure(self, node_id: str, failure_reason: str) -> bool:
    """Handle a failed node by attempting failover to a different server"""

    # Check if failover already in progress
    if node_id in self.failover_in_progress:
        return False

    self.failover_in_progress.add(node_id)

    try:
        # Get node details and alternative server
        node = self.docker_manager.get_node_details(node_id)
        country = node['country']
        current_server = node['server']

        # Check failover limits
        if not self._can_failover(node_id):
            return False

        # Get alternative server
        new_server = await self._get_alternative_server(country, current_server)
        if not new_server:
            return False

        # Perform failover
        success = await self._perform_failover(node_id, country, new_server)

        # Record attempt
        self._record_failover_attempt(node_id, country, current_server, new_server, success)

        return success

    finally:
        self.failover_in_progress.discard(node_id)

Failover Constraints¶

class FailoverManager:
    def __init__(self):
        self.max_failover_attempts = 3      # Max attempts per hour
        self.failover_cooldown = 300        # 5 minutes between attempts
        self.failover_history = {}          # Track attempts per node

Server Selection for Failover¶

async def _get_alternative_server(self, country: str, exclude_server: str) -> Optional[str]:
    """Get an alternative server for failover"""

    # Get all servers for the country
    servers = vpn_server_manager.get_servers_for_country(country) 

    # Filter out current and blacklisted servers
    available_servers = [
        s for s in servers 
        if s['hostname'] != exclude_server 
        and not redis_manager.is_server_blacklisted(s['hostname'])
    ]

    # Sort by health score
    available_servers.sort(key=lambda s: s.get('health_score', 50), reverse=True)

    # Test top 3 servers
    for server in available_servers[:3]:
        success, latency = await vpn_server_manager.health_check_server(server['hostname'])
        if success:
            return server['hostname']

    return available_servers[0]['hostname'] if available_servers else None

Recovery Procedures¶

Immediate Recovery: Stop failed container, start new one with different server
Graceful Recovery: Wait for existing connections to drain before switching
Rollback Recovery: Return to previous working server if new server fails

Configuration and Tuning¶

Load Balancing Parameters¶

# /opt/vpn-exit-controller/.env
LOAD_BALANCER_STRATEGY=health_score
LOAD_BALANCER_ENABLED=true
MAX_NODES_PER_COUNTRY=3
AUTO_SCALE_ENABLED=true
SCALE_UP_THRESHOLD=50  # connections per node
SCALE_DOWN_THRESHOLD=10  # connections per node

Health Check Intervals¶

# Configuration in services/metrics_collector.py
class MetricsCollector:
    def __init__(self, interval_seconds: int = 30):  # Collect every 30 seconds
        self.interval = interval_seconds

Performance Thresholds¶

# CPU and memory thresholds for scaling decisions
CPU_THRESHOLD_HIGH = 80.0    # Scale up trigger
CPU_THRESHOLD_LOW = 20.0     # Scale down trigger
MEMORY_THRESHOLD_HIGH = 500  # MB, scale up trigger  
MEMORY_THRESHOLD_LOW = 300   # MB, scale down trigger

Auto-scaling Configuration¶

async def start_additional_node_if_needed(self, country: str) -> bool:
    """Start additional node if load is high"""
    nodes = self._get_healthy_nodes_for_country(country)

    if not nodes:
        return False

    # Check if we need more capacity
    total_connections = sum(redis_manager.get_connection_count(n['id']) for n in nodes)
    avg_connections_per_node = total_connections / len(nodes)

    # Start new node if average > 50 connections per node and < 3 nodes
    if avg_connections_per_node > 50 and len(nodes) < 3:
        logger.info(f"High load detected for {country}, starting additional node")
        # Start new node...
        return True

    return False

Tuning Recommendations¶

Scenario	Strategy	Max Nodes	Thresholds
High Traffic	`health_score`	5	Scale up: 30 conn/node
Low Latency	`weighted_latency`	3	Scale up: 20 conn/node
Cost Optimized	`least_connections`	2	Scale up: 80 conn/node
Testing	`round_robin`	3	Scale up: 50 conn/node

Monitoring and Metrics¶

Key Metrics Collection¶

The system continuously collects metrics for load balancing decisions:

class MetricsCollector:
    """Background service that continuously collects metrics from all nodes"""

    async def _collect_node_metrics(self, node_id: str):
        """Collect metrics for a single node"""

        # Get detailed node info (includes Docker stats)
        node_details = self.docker_manager.get_node_details(node_id)

        # Check for anomalies
        if node_details.get('stats'):
            stats = node_details['stats']

            # Alert on high resource usage
            if stats.get('cpu_percent', 0) > 80:
                logger.warning(f"High CPU usage on node {node_id}: {stats['cpu_percent']:.1f}%")

            if stats.get('memory_mb', 0) > 500:
                logger.warning(f"High memory usage on node {node_id}: {stats['memory_mb']:.1f}MB")

Load Balancing Statistics¶

def get_load_balancing_stats(self) -> Dict:
    """Get comprehensive load balancing statistics"""

    stats = {
        'strategies': [s.value for s in LoadBalancingStrategy],
        'round_robin_counters': self.round_robin_counters,
        'countries': {}
    }

    # Get stats per country
    all_nodes = self.docker_manager.list_nodes()
    countries = set(n['country'] for n in all_nodes)

    for country in countries:
        nodes = self._get_healthy_nodes_for_country(country)
        total_connections = sum(redis_manager.get_connection_count(n['id']) for n in nodes)

        stats['countries'][country] = {
            'node_count': len(nodes),
            'total_connections': total_connections,
            'avg_connections_per_node': total_connections / len(nodes) if nodes else 0,
            'nodes': [
                {
                    'id': n['id'],
                    'server': n.get('vpn_server', 'unknown'),
                    'connections': redis_manager.get_connection_count(n['id']),
                    'tailscale_ip': n.get('tailscale_ip'),
                    'cpu_percent': n.get('stats', {}).get('cpu_percent', 0),
                    'health_score': await self._calculate_node_health_score(n)
                }
                for n in nodes
            ]
        }

    return stats

Performance Monitoring¶

# Monitor load balancing in real-time
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq

# Get speed test summary
curl -u admin:password http://localhost:8080/api/speed-test/summary | jq

# Monitor metrics
curl -u admin:password http://localhost:8080/api/metrics/current | jq

Alert Conditions¶

Condition	Threshold	Action
High CPU Usage	>80% for 5 min	Scale up or failover
High Memory	>500MB	Scale up or failover
High Connection Count	>100 per node	Scale up
Low Speed	<10 Mbps	Investigate/failover
High Latency	>200ms	Switch strategy or failover
Node Down	Health check fails	Immediate failover

Reporting and Analysis¶

# Generate load balancing report
async def generate_load_balancing_report(hours: int = 24) -> Dict:
    """Generate comprehensive load balancing report"""

    report = {
        'period_hours': hours,
        'generated_at': datetime.utcnow().isoformat(),
        'summary': {},
        'by_country': {},
        'performance_trends': {},
        'recommendations': []
    }

    # Analyze each country
    countries = get_all_countries()

    for country in countries:
        nodes = get_nodes_for_country(country)

        # Calculate statistics
        total_connections = sum(get_connection_count(n['id']) for n in nodes)
        avg_speed = calculate_avg_speed(nodes, hours)
        avg_latency = calculate_avg_latency(nodes, hours)

        report['by_country'][country] = {
            'node_count': len(nodes),
            'total_connections': total_connections,
            'avg_speed_mbps': avg_speed,
            'avg_latency_ms': avg_latency,
            'failover_events': count_failover_events(country, hours)
        }

        # Generate recommendations
        if avg_speed < 20:
            report['recommendations'].append(f"Consider adding more nodes to {country} - low speed detected")

        if total_connections / len(nodes) > 50:
            report['recommendations'].append(f"Scale up {country} - high load detected")

    return report

Advanced Features¶

Connection Affinity/Sticky Sessions¶

class ConnectionAffinity:
    """Manage connection affinity for consistent routing"""

    def __init__(self):
        self.client_node_map = {}  # client_ip -> node_id
        self.affinity_timeout = 3600  # 1 hour

    async def get_affinity_node(self, client_ip: str, country: str) -> Optional[str]:
        """Get node with existing affinity for client"""

        affinity_key = f"affinity:{client_ip}:{country}"
        node_id = redis_manager.client.get(affinity_key)

        if node_id:
            # Check if node is still healthy
            healthy, _ = docker_manager.check_container_health(node_id)
            if healthy:
                # Refresh affinity timeout
                redis_manager.client.expire(affinity_key, self.affinity_timeout)
                return node_id
            else:
                # Remove stale affinity
                redis_manager.client.delete(affinity_key)

        return None

    async def set_affinity(self, client_ip: str, country: str, node_id: str):
        """Set client affinity to specific node"""
        affinity_key = f"affinity:{client_ip}:{country}"
        redis_manager.client.setex(affinity_key, self.affinity_timeout, node_id)

Geographic Routing Preferences¶

class GeographicRouter:
    """Route based on geographic preferences"""

    REGION_PREFERENCES = {
        'americas': ['us', 'ca', 'br'],
        'europe': ['de', 'uk', 'fr', 'nl'],
        'asia': ['jp', 'sg', 'hk', 'au'],
        'africa': ['za'],
        'oceania': ['au', 'nz']
    }

    async def get_preferred_country(self, client_region: str, requested_country: str) -> str:
        """Get preferred country based on client region"""

        # Return requested country if available and healthy
        if self.is_country_healthy(requested_country):
            return requested_country

        # Find alternative in same region
        preferred_countries = self.REGION_PREFERENCES.get(client_region, [])

        for country in preferred_countries:
            if self.is_country_healthy(country):
                logger.info(f"Routing {client_region} client to {country} instead of {requested_country}")
                return country

        # Fallback to any healthy country
        return self.get_any_healthy_country()

Custom Load Balancing Rules¶

class CustomLoadBalancingRules:
    """Implement custom load balancing rules"""

    def __init__(self):
        self.rules = []

    def add_rule(self, rule: Dict):
        """Add custom routing rule"""
        self.rules.append({
            'id': str(uuid4()),
            'name': rule['name'],
            'condition': rule['condition'],
            'action': rule['action'],
            'priority': rule.get('priority', 100),
            'enabled': rule.get('enabled', True)
        })

    async def evaluate_rules(self, context: Dict) -> Optional[str]:
        """Evaluate rules and return target node"""

        # Sort by priority
        active_rules = sorted(
            [r for r in self.rules if r['enabled']], 
            key=lambda x: x['priority']
        )

        for rule in active_rules:
            if self._matches_condition(rule['condition'], context):
                return await self._execute_action(rule['action'], context)

        return None

    def _matches_condition(self, condition: Dict, context: Dict) -> bool:
        """Check if context matches rule condition"""

        # Example conditions:
        # {"source_device": "iPhone", "domain": "*.streaming.com"}
        # {"time_range": "09:00-17:00", "country": "us"}
        # {"client_ip_range": "192.168.1.0/24"}

        for key, value in condition.items():
            if key == 'source_device':
                if context.get('user_agent', '').find(value) == -1:
                    return False

            elif key == 'domain':
                if not fnmatch.fnmatch(context.get('domain', ''), value):
                    return False

            elif key == 'time_range':
                current_time = datetime.now().strftime('%H:%M')
                start, end = value.split('-')
                if not (start <= current_time <= end):
                    return False

        return True

API-based Load Balancing Control¶

# Extended API endpoints for advanced control

@router.post("/rules")
async def create_load_balancing_rule(rule: CustomRule, user=Depends(verify_auth)):
    """Create custom load balancing rule"""
    custom_rules.add_rule(rule.dict())
    return {"status": "rule_created", "rule": rule}

@router.put("/strategy/{country}")
async def set_country_strategy(
    country: str, 
    strategy: LoadBalancingStrategy,
    user=Depends(verify_auth)
):
    """Set load balancing strategy for specific country"""
    load_balancer.set_country_strategy(country, strategy)
    return {"country": country, "strategy": strategy.value}

@router.post("/rebalance/{country}")
async def force_rebalance(country: str, user=Depends(verify_auth)):
    """Force rebalancing of connections in a country"""
    result = await load_balancer.rebalance_country(country)
    return {"country": country, "rebalanced_connections": result}

@router.get("/prediction/{country}")
async def get_load_prediction(country: str, hours: int = 1, user=Depends(verify_auth)):
    """Get load prediction for next N hours"""
    prediction = await load_balancer.predict_load(country, hours)
    return prediction

API Reference¶

Load Balancer Endpoints¶

Get Load Balancing Statistics¶

GET /api/load-balancer/stats
Authorization: Basic <credentials>

Response:

{
  "strategies": ["round_robin", "least_connections", "weighted_latency", "random", "health_score"],
  "round_robin_counters": {"us": 5, "uk": 2},
  "countries": {
    "us": {
      "node_count": 2,
      "total_connections": 45,
      "avg_connections_per_node": 22.5,
      "nodes": [
        {
          "id": "container_123",
          "server": "us5063.nordvpn.com",
          "connections": 25,
          "tailscale_ip": "100.73.33.15",
          "cpu_percent": 45.2,
          "health_score": 87.3
        }
      ]
    }
  }
}

Get Best Node for Country¶

GET /api/load-balancer/best-node/{country}?strategy=health_score
Authorization: Basic <credentials>

Response:

{
  "selected_node": {
    "id": "container_123",
    "country": "us",
    "server": "us5063.nordvpn.com",
    "tailscale_ip": "100.73.33.15",
    "health_score": 87.3
  },
  "strategy": "health_score",
  "country": "us"
}

Scale Up Country¶

POST /api/load-balancer/scale-up/{country}
Authorization: Basic <credentials>

Scale Down Country¶

POST /api/load-balancer/scale-down/{country}
Authorization: Basic <credentials>

Get Available Strategies¶

GET /api/load-balancer/strategies
Authorization: Basic <credentials>

Response:

{
  "strategies": [
    {
      "name": "round_robin",
      "description": "Distributes requests evenly across all healthy nodes"
    },
    {
      "name": "least_connections", 
      "description": "Routes to the node with fewest active connections"
    },
    {
      "name": "weighted_latency",
      "description": "Routes based on server latency with weighted randomization"
    },
    {
      "name": "random",
      "description": "Randomly selects from available healthy nodes"
    },
    {
      "name": "health_score",
      "description": "Routes to node with best overall health score (CPU, memory, latency, connections)"
    }
  ]
}

Troubleshooting¶

Common Issues¶

1. No Healthy Nodes Available¶

Symptoms: - API returns 404 "No healthy nodes available" - Load balancer cannot route traffic

Diagnosis:

# Check node health
curl -u admin:password http://localhost:8080/api/nodes/list | jq '.[] | select(.status == "running")'

# Check container health
docker ps --filter "label=vpn-exit-node"

# Check VPN connections
docker exec <container_id> curl -s ipinfo.io

Solutions: 1. Restart unhealthy containers: docker restart <container_id> 2. Check VPN credentials in /opt/vpn-exit-controller/configs/auth.txt 3. Verify network connectivity: docker exec <container_id> ping 8.8.8.8 4. Force failover: curl -X POST http://localhost:8080/api/failover/force/<node_id>

2. Load Imbalance¶

Symptoms: - One node has significantly more connections than others - Performance degradation on overloaded nodes

Diagnosis:

# Check connection distribution
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq '.countries'

# Check strategy
curl -u admin:password http://localhost:8080/api/config | jq '.load_balancer'

Solutions: 1. Switch to least_connections strategy 2. Force rebalancing: curl -X POST http://localhost:8080/api/load-balancer/rebalance/<country> 3. Increase connection drain timeout 4. Add more nodes: curl -X POST http://localhost:8080/api/load-balancer/scale-up/<country>

3. Frequent Failovers¶

Symptoms: - High number of failover events in logs - Unstable node assignments

Diagnosis:

# Check failover history
curl -u admin:password http://localhost:8080/api/failover/status | jq

# Check server health
curl -u admin:password http://localhost:8080/api/speed-test/summary | jq

Solutions: 1. Increase failover cooldown period 2. Check VPN server stability 3. Review health check thresholds 4. Blacklist problematic servers

4. Poor Performance¶

Symptoms: - Slow connection speeds - High latency

Diagnosis:

# Run speed tests
curl -X POST -u admin:password http://localhost:8080/api/speed-test/run-all

# Check health scores
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq '.countries[].nodes[].health_score'

Solutions: 1. Switch to weighted_latency strategy
2. Add more nodes in region 3. Use different VPN servers 4. Check network congestion

Debug Commands¶

# Enable debug logging
export LOG_LEVEL=DEBUG

# Check Redis data
redis-cli
> KEYS speedtest:*
> KEYS affinity:*
> KEYS server_health:*

# Monitor load balancer decisions
journalctl -u vpn-controller -f | grep "load_balancer"

# Test specific node
curl -X POST -u admin:password http://localhost:8080/api/speed-test/node/<node_id>

# Force strategy change
curl -X PUT -u admin:password http://localhost:8080/api/load-balancer/strategy/<country> \
  -H "Content-Type: application/json" \
  -d '{"strategy": "health_score"}'

Performance Optimization Tips¶

Strategy Selection:
Use health_score for general purpose
Use weighted_latency for latency-sensitive apps
Use least_connections for long-lived connections
Resource Tuning:
Monitor CPU/memory usage patterns
Adjust scaling thresholds based on traffic
Set appropriate connection limits
Network Optimization:
Choose VPN servers close to users
Monitor and blacklist slow servers
Use multiple servers per country
Monitoring:
Set up alerts for health score < 70
Monitor failover frequency
Track connection distribution

This comprehensive load balancing system ensures optimal performance, reliability, and scalability for the VPN Exit Controller infrastructure.