Skip to content

Load Balancing System Documentation

Table of Contents

  1. Load Balancing Overview
  2. Load Balancing Strategies
  3. Health Score Algorithm
  4. Speed Testing Integration
  5. Failover Logic
  6. Configuration and Tuning
  7. Monitoring and Metrics
  8. Advanced Features
  9. API Reference
  10. Troubleshooting

Load Balancing Overview

Purpose and Benefits

The VPN Exit Controller implements intelligent load balancing to distribute traffic across multiple VPN exit nodes within each country. This provides several key benefits:

  • High Availability: Automatic failover when nodes become unhealthy
  • Performance Optimization: Route traffic to the fastest available nodes
  • Scalability: Automatic scaling based on connection load
  • Resource Efficiency: Optimal utilization of compute resources
  • Geographic Distribution: Balanced load across different VPN servers

Integration with Failover Systems

The load balancer works closely with the failover manager to ensure service continuity:

# Example: Load balancer + failover integration
if not healthy_nodes:
    # Trigger failover to different VPN server
    await failover_manager.handle_node_failure(node_id, "no_healthy_nodes")

# Recheck for healthy nodes after failover
healthy_nodes = self._get_healthy_nodes_for_country(country)

Load Balancing Strategies

The system supports five distinct load balancing strategies, each optimized for different scenarios:

1. Round Robin Strategy

Purpose: Simple, fair distribution of connections across all healthy nodes.

Algorithm:

async def _round_robin_select(self, nodes: List[Dict], country: str) -> Dict:
    """Round-robin selection"""
    if country not in self.round_robin_counters:
        self.round_robin_counters[country] = 0

    selected_index = self.round_robin_counters[country] % len(nodes)
    self.round_robin_counters[country] += 1

    return nodes[selected_index]

Best For: - Evenly distributed workloads - Testing scenarios - When all nodes have similar performance characteristics

Characteristics: - Maintains per-country counters - Guarantees fair distribution - No performance consideration

2. Least Connections Strategy

Purpose: Route new connections to the node with the fewest active connections.

Algorithm:

async def _least_connections_select(self, nodes: List[Dict], country: str) -> Dict:
    """Select node with least connections"""
    node_connections = []

    for node in nodes:
        connection_count = redis_manager.get_connection_count(node['id'])
        node_connections.append((node, connection_count))

    # Sort by connection count (ascending)
    node_connections.sort(key=lambda x: x[1])
    return node_connections[0][0]

Best For: - Long-lived connections - Scenarios where connection duration varies significantly - Optimizing connection distribution

Monitoring:

# Check connection counts via API
curl -u admin:password http://localhost:8080/api/load-balancer/stats

3. Weighted Latency Strategy

Purpose: Route traffic based on server latency with weighted randomization.

Algorithm:

async def _weighted_latency_select(self, nodes: List[Dict], country: str) -> Dict:
    """Select based on weighted latency scores"""
    node_scores = []

    for node in nodes:
        # Get server latency from Redis
        server_health = redis_manager.get_server_health(node.get('vpn_server', ''))
        latency = server_health.get('latency', 100) if server_health else 100

        # Lower latency = higher weight
        weight = max(1, 200 - latency)  # Weight between 1-199
        node_scores.append((node, weight))

    # Weighted random selection
    total_weight = sum(score[1] for score in node_scores)
    random_point = random.uniform(0, total_weight)

    current_weight = 0
    for node, weight in node_scores:
        current_weight += weight
        if current_weight >= random_point:
            return node

Weight Calculation: - Latency 50ms → Weight 150 - Latency 100ms → Weight 100
- Latency 150ms → Weight 50 - Latency 200ms+ → Weight 1

Best For: - Latency-sensitive applications - Real-time communications - Gaming or streaming workloads

4. Random Strategy

Purpose: Randomly distribute connections for simple load distribution.

Algorithm:

async def _random_select(self, nodes: List[Dict], country: str) -> Dict:
    """Random selection"""
    return random.choice(nodes)

Best For: - Simple load distribution - Development and testing - When other strategies are not applicable

5. Health Score Strategy (Default)

Purpose: Select nodes based on comprehensive health scores considering multiple factors.

Algorithm:

async def _health_score_select(self, nodes: List[Dict], country: str) -> Dict:
    """Select based on comprehensive health score"""
    node_scores = []

    for node in nodes:
        score = await self._calculate_node_health_score(node)
        node_scores.append((node, score))

    # Sort by score (descending - higher is better)
    node_scores.sort(key=lambda x: x[1], reverse=True)
    return node_scores[0][0]

Health Score Algorithm

The health score algorithm provides a comprehensive assessment of node performance by weighing multiple factors:

Score Calculation

async def _calculate_node_health_score(self, node: Dict) -> float:
    """Calculate comprehensive health score for a node"""
    score = 100.0  # Start with perfect score

    # Factor 1: Server latency (40% weight)
    server_health = redis_manager.get_server_health(node.get('vpn_server', ''))
    if server_health:
        latency = server_health.get('latency', 100)
        # Score: 100ms=90, 50ms=95, 200ms=80
        latency_score = max(50, 100 - (latency - 50) * 0.5)
        score = score * 0.6 + latency_score * 0.4

    # Factor 2: Connection count (30% weight) 
    connection_count = redis_manager.get_connection_count(node['id'])
    # Penalize high connection counts
    connection_penalty = min(20, connection_count * 2)
    connection_score = max(60, 100 - connection_penalty)
    score = score * 0.7 + connection_score * 0.3

    # Factor 3: CPU usage (20% weight)
    stats = node.get('stats', {})
    cpu_percent = stats.get('cpu_percent', 0)
    cpu_score = max(60, 100 - cpu_percent)
    score = score * 0.8 + cpu_score * 0.2

    # Factor 4: Memory usage (10% weight)
    memory_mb = stats.get('memory_mb', 0)
    # Penalize if using > 300MB
    memory_penalty = max(0, (memory_mb - 300) / 10)
    memory_score = max(70, 100 - memory_penalty)
    score = score * 0.9 + memory_score * 0.1

    return score

Scoring Factors

Factor Weight Description Range
Server Latency 40% Network latency to VPN server 50-100
Connection Count 30% Number of active connections 60-100
CPU Usage 20% Container CPU utilization 60-100
Memory Usage 10% Container memory consumption 70-100

Score Interpretation

  • 90-100: Excellent performance, optimal for routing
  • 80-89: Good performance, suitable for most traffic
  • 70-79: Acceptable performance, may experience delays
  • 60-69: Poor performance, consider failover
  • <60: Critical issues, automatic failover triggered

Health Score Thresholds

# Configuration examples
EXCELLENT_THRESHOLD = 90.0
GOOD_THRESHOLD = 80.0
ACCEPTABLE_THRESHOLD = 70.0
POOR_THRESHOLD = 60.0
CRITICAL_THRESHOLD = 50.0

Speed Testing Integration

Speed testing provides crucial data for load balancing decisions through comprehensive performance evaluation.

Speed Test Components

Download Speed Testing

async def _test_download_speed(self, node_id: str, test_url: str) -> Dict:
    """Test download speed by downloading a file inside the container"""

    # Create curl command to test download speed
    curl_cmd = [
        "curl", "-s", "-w", 
        "%{time_total},%{speed_download},%{size_download}",
        "-o", "/dev/null",
        "--max-time", "60",  # 60 second timeout
        test_url
    ]

    container = self.docker_manager.client.containers.get(node_id)
    result = container.exec_run(curl_cmd, demux=False)

    # Parse results and convert to Mbps
    time_total, speed_download, size_download = output.split(',')
    mbps = (float(speed_download) * 8) / (1024 * 1024)

    return {
        'mbps': mbps,
        'time_seconds': float(time_total),
        'size_bytes': float(size_download)
    }

Latency Testing

async def _test_latency(self, node_id: str) -> Dict:
    """Test latency to multiple endpoints"""

    ping_endpoints = [
        "https://www.google.com",
        "https://www.cloudflare.com", 
        "https://www.github.com",
        "https://httpbin.org/ip"
    ]

    # Test each endpoint and calculate average
    successful_tests = []
    for endpoint in ping_endpoints:
        # Use curl to measure connection time
        latency_ms = float(connect_time) * 1000
        successful_tests.append(latency_ms)

    avg_latency = sum(successful_tests) / len(successful_tests)
    return {'avg_latency': avg_latency, 'tests': latency_tests}

Speed Test Scheduling

# Automatic speed testing
async def schedule_speed_tests():
    """Run speed tests on all nodes every hour"""
    while True:
        try:
            results = await speed_tester.test_all_nodes("1MB")
            logger.info(f"Speed tests completed: {len(results)} nodes tested")
        except Exception as e:
            logger.error(f"Speed test cycle failed: {e}")

        await asyncio.sleep(3600)  # 1 hour interval

Historical Data Usage

Speed test results are stored in Redis with time-series data:

def _store_speed_test_result(self, node_id: str, result: Dict):
    """Store speed test result in Redis"""

    # Store latest result (1 hour TTL)
    key = f"speedtest:{node_id}:latest"
    redis_manager.client.setex(key, 3600, json.dumps(result))

    # Store in history (keep last 24 hours)
    history_key = f"speedtest:{node_id}:history"
    timestamp = datetime.utcnow().timestamp()
    redis_manager.client.zadd(history_key, {json.dumps(result): timestamp})

    # Remove old entries (older than 24 hours)
    cutoff = (datetime.utcnow() - timedelta(hours=24)).timestamp()
    redis_manager.client.zremrangebyscore(history_key, 0, cutoff)

Performance Trend Analysis

def analyze_performance_trends(node_id: str) -> Dict:
    """Analyze performance trends over time"""

    history = speed_tester.get_speed_test_history(node_id, hours=24)
    if len(history) < 2:
        return {"trend": "insufficient_data"}

    speeds = [h['download_mbps'] for h in history if 'download_mbps' in h]
    latencies = [h['latency_ms'] for h in history if 'latency_ms' in h]

    # Calculate trends
    speed_trend = "improving" if speeds[-1] > speeds[0] else "degrading"
    latency_trend = "improving" if latencies[-1] < latencies[0] else "degrading"

    return {
        "speed_trend": speed_trend,
        "latency_trend": latency_trend,
        "avg_speed_24h": sum(speeds) / len(speeds),
        "avg_latency_24h": sum(latencies) / len(latencies)
    }

Failover Logic

The failover system ensures service continuity when nodes become unhealthy or disconnected.

Automatic Failover Triggers

  1. VPN Connection Failure: Node loses connection to VPN server
  2. High Resource Usage: CPU > 90% or Memory > 1GB for 5 minutes
  3. Network Connectivity Issues: Cannot reach test endpoints
  4. Container Health Check Failure: Docker health checks fail

Failover Process

async def handle_node_failure(self, node_id: str, failure_reason: str) -> bool:
    """Handle a failed node by attempting failover to a different server"""

    # Check if failover already in progress
    if node_id in self.failover_in_progress:
        return False

    self.failover_in_progress.add(node_id)

    try:
        # Get node details and alternative server
        node = self.docker_manager.get_node_details(node_id)
        country = node['country']
        current_server = node['server']

        # Check failover limits
        if not self._can_failover(node_id):
            return False

        # Get alternative server
        new_server = await self._get_alternative_server(country, current_server)
        if not new_server:
            return False

        # Perform failover
        success = await self._perform_failover(node_id, country, new_server)

        # Record attempt
        self._record_failover_attempt(node_id, country, current_server, new_server, success)

        return success

    finally:
        self.failover_in_progress.discard(node_id)

Failover Constraints

class FailoverManager:
    def __init__(self):
        self.max_failover_attempts = 3      # Max attempts per hour
        self.failover_cooldown = 300        # 5 minutes between attempts
        self.failover_history = {}          # Track attempts per node

Server Selection for Failover

async def _get_alternative_server(self, country: str, exclude_server: str) -> Optional[str]:
    """Get an alternative server for failover"""

    # Get all servers for the country
    servers = vpn_server_manager.get_servers_for_country(country) 

    # Filter out current and blacklisted servers
    available_servers = [
        s for s in servers 
        if s['hostname'] != exclude_server 
        and not redis_manager.is_server_blacklisted(s['hostname'])
    ]

    # Sort by health score
    available_servers.sort(key=lambda s: s.get('health_score', 50), reverse=True)

    # Test top 3 servers
    for server in available_servers[:3]:
        success, latency = await vpn_server_manager.health_check_server(server['hostname'])
        if success:
            return server['hostname']

    return available_servers[0]['hostname'] if available_servers else None

Recovery Procedures

  1. Immediate Recovery: Stop failed container, start new one with different server
  2. Graceful Recovery: Wait for existing connections to drain before switching
  3. Rollback Recovery: Return to previous working server if new server fails

Configuration and Tuning

Load Balancing Parameters

# /opt/vpn-exit-controller/.env
LOAD_BALANCER_STRATEGY=health_score
LOAD_BALANCER_ENABLED=true
MAX_NODES_PER_COUNTRY=3
AUTO_SCALE_ENABLED=true
SCALE_UP_THRESHOLD=50  # connections per node
SCALE_DOWN_THRESHOLD=10  # connections per node

Health Check Intervals

# Configuration in services/metrics_collector.py
class MetricsCollector:
    def __init__(self, interval_seconds: int = 30):  # Collect every 30 seconds
        self.interval = interval_seconds

Performance Thresholds

# CPU and memory thresholds for scaling decisions
CPU_THRESHOLD_HIGH = 80.0    # Scale up trigger
CPU_THRESHOLD_LOW = 20.0     # Scale down trigger
MEMORY_THRESHOLD_HIGH = 500  # MB, scale up trigger  
MEMORY_THRESHOLD_LOW = 300   # MB, scale down trigger

Auto-scaling Configuration

async def start_additional_node_if_needed(self, country: str) -> bool:
    """Start additional node if load is high"""
    nodes = self._get_healthy_nodes_for_country(country)

    if not nodes:
        return False

    # Check if we need more capacity
    total_connections = sum(redis_manager.get_connection_count(n['id']) for n in nodes)
    avg_connections_per_node = total_connections / len(nodes)

    # Start new node if average > 50 connections per node and < 3 nodes
    if avg_connections_per_node > 50 and len(nodes) < 3:
        logger.info(f"High load detected for {country}, starting additional node")
        # Start new node...
        return True

    return False

Tuning Recommendations

Scenario Strategy Max Nodes Thresholds
High Traffic health_score 5 Scale up: 30 conn/node
Low Latency weighted_latency 3 Scale up: 20 conn/node
Cost Optimized least_connections 2 Scale up: 80 conn/node
Testing round_robin 3 Scale up: 50 conn/node

Monitoring and Metrics

Key Metrics Collection

The system continuously collects metrics for load balancing decisions:

class MetricsCollector:
    """Background service that continuously collects metrics from all nodes"""

    async def _collect_node_metrics(self, node_id: str):
        """Collect metrics for a single node"""

        # Get detailed node info (includes Docker stats)
        node_details = self.docker_manager.get_node_details(node_id)

        # Check for anomalies
        if node_details.get('stats'):
            stats = node_details['stats']

            # Alert on high resource usage
            if stats.get('cpu_percent', 0) > 80:
                logger.warning(f"High CPU usage on node {node_id}: {stats['cpu_percent']:.1f}%")

            if stats.get('memory_mb', 0) > 500:
                logger.warning(f"High memory usage on node {node_id}: {stats['memory_mb']:.1f}MB")

Load Balancing Statistics

def get_load_balancing_stats(self) -> Dict:
    """Get comprehensive load balancing statistics"""

    stats = {
        'strategies': [s.value for s in LoadBalancingStrategy],
        'round_robin_counters': self.round_robin_counters,
        'countries': {}
    }

    # Get stats per country
    all_nodes = self.docker_manager.list_nodes()
    countries = set(n['country'] for n in all_nodes)

    for country in countries:
        nodes = self._get_healthy_nodes_for_country(country)
        total_connections = sum(redis_manager.get_connection_count(n['id']) for n in nodes)

        stats['countries'][country] = {
            'node_count': len(nodes),
            'total_connections': total_connections,
            'avg_connections_per_node': total_connections / len(nodes) if nodes else 0,
            'nodes': [
                {
                    'id': n['id'],
                    'server': n.get('vpn_server', 'unknown'),
                    'connections': redis_manager.get_connection_count(n['id']),
                    'tailscale_ip': n.get('tailscale_ip'),
                    'cpu_percent': n.get('stats', {}).get('cpu_percent', 0),
                    'health_score': await self._calculate_node_health_score(n)
                }
                for n in nodes
            ]
        }

    return stats

Performance Monitoring

# Monitor load balancing in real-time
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq

# Get speed test summary
curl -u admin:password http://localhost:8080/api/speed-test/summary | jq

# Monitor metrics
curl -u admin:password http://localhost:8080/api/metrics/current | jq

Alert Conditions

Condition Threshold Action
High CPU Usage >80% for 5 min Scale up or failover
High Memory >500MB Scale up or failover
High Connection Count >100 per node Scale up
Low Speed <10 Mbps Investigate/failover
High Latency >200ms Switch strategy or failover
Node Down Health check fails Immediate failover

Reporting and Analysis

# Generate load balancing report
async def generate_load_balancing_report(hours: int = 24) -> Dict:
    """Generate comprehensive load balancing report"""

    report = {
        'period_hours': hours,
        'generated_at': datetime.utcnow().isoformat(),
        'summary': {},
        'by_country': {},
        'performance_trends': {},
        'recommendations': []
    }

    # Analyze each country
    countries = get_all_countries()

    for country in countries:
        nodes = get_nodes_for_country(country)

        # Calculate statistics
        total_connections = sum(get_connection_count(n['id']) for n in nodes)
        avg_speed = calculate_avg_speed(nodes, hours)
        avg_latency = calculate_avg_latency(nodes, hours)

        report['by_country'][country] = {
            'node_count': len(nodes),
            'total_connections': total_connections,
            'avg_speed_mbps': avg_speed,
            'avg_latency_ms': avg_latency,
            'failover_events': count_failover_events(country, hours)
        }

        # Generate recommendations
        if avg_speed < 20:
            report['recommendations'].append(f"Consider adding more nodes to {country} - low speed detected")

        if total_connections / len(nodes) > 50:
            report['recommendations'].append(f"Scale up {country} - high load detected")

    return report

Advanced Features

Connection Affinity/Sticky Sessions

class ConnectionAffinity:
    """Manage connection affinity for consistent routing"""

    def __init__(self):
        self.client_node_map = {}  # client_ip -> node_id
        self.affinity_timeout = 3600  # 1 hour

    async def get_affinity_node(self, client_ip: str, country: str) -> Optional[str]:
        """Get node with existing affinity for client"""

        affinity_key = f"affinity:{client_ip}:{country}"
        node_id = redis_manager.client.get(affinity_key)

        if node_id:
            # Check if node is still healthy
            healthy, _ = docker_manager.check_container_health(node_id)
            if healthy:
                # Refresh affinity timeout
                redis_manager.client.expire(affinity_key, self.affinity_timeout)
                return node_id
            else:
                # Remove stale affinity
                redis_manager.client.delete(affinity_key)

        return None

    async def set_affinity(self, client_ip: str, country: str, node_id: str):
        """Set client affinity to specific node"""
        affinity_key = f"affinity:{client_ip}:{country}"
        redis_manager.client.setex(affinity_key, self.affinity_timeout, node_id)

Geographic Routing Preferences

class GeographicRouter:
    """Route based on geographic preferences"""

    REGION_PREFERENCES = {
        'americas': ['us', 'ca', 'br'],
        'europe': ['de', 'uk', 'fr', 'nl'],
        'asia': ['jp', 'sg', 'hk', 'au'],
        'africa': ['za'],
        'oceania': ['au', 'nz']
    }

    async def get_preferred_country(self, client_region: str, requested_country: str) -> str:
        """Get preferred country based on client region"""

        # Return requested country if available and healthy
        if self.is_country_healthy(requested_country):
            return requested_country

        # Find alternative in same region
        preferred_countries = self.REGION_PREFERENCES.get(client_region, [])

        for country in preferred_countries:
            if self.is_country_healthy(country):
                logger.info(f"Routing {client_region} client to {country} instead of {requested_country}")
                return country

        # Fallback to any healthy country
        return self.get_any_healthy_country()

Custom Load Balancing Rules

class CustomLoadBalancingRules:
    """Implement custom load balancing rules"""

    def __init__(self):
        self.rules = []

    def add_rule(self, rule: Dict):
        """Add custom routing rule"""
        self.rules.append({
            'id': str(uuid4()),
            'name': rule['name'],
            'condition': rule['condition'],
            'action': rule['action'],
            'priority': rule.get('priority', 100),
            'enabled': rule.get('enabled', True)
        })

    async def evaluate_rules(self, context: Dict) -> Optional[str]:
        """Evaluate rules and return target node"""

        # Sort by priority
        active_rules = sorted(
            [r for r in self.rules if r['enabled']], 
            key=lambda x: x['priority']
        )

        for rule in active_rules:
            if self._matches_condition(rule['condition'], context):
                return await self._execute_action(rule['action'], context)

        return None

    def _matches_condition(self, condition: Dict, context: Dict) -> bool:
        """Check if context matches rule condition"""

        # Example conditions:
        # {"source_device": "iPhone", "domain": "*.streaming.com"}
        # {"time_range": "09:00-17:00", "country": "us"}
        # {"client_ip_range": "192.168.1.0/24"}

        for key, value in condition.items():
            if key == 'source_device':
                if context.get('user_agent', '').find(value) == -1:
                    return False

            elif key == 'domain':
                if not fnmatch.fnmatch(context.get('domain', ''), value):
                    return False

            elif key == 'time_range':
                current_time = datetime.now().strftime('%H:%M')
                start, end = value.split('-')
                if not (start <= current_time <= end):
                    return False

        return True

API-based Load Balancing Control

# Extended API endpoints for advanced control

@router.post("/rules")
async def create_load_balancing_rule(rule: CustomRule, user=Depends(verify_auth)):
    """Create custom load balancing rule"""
    custom_rules.add_rule(rule.dict())
    return {"status": "rule_created", "rule": rule}

@router.put("/strategy/{country}")
async def set_country_strategy(
    country: str, 
    strategy: LoadBalancingStrategy,
    user=Depends(verify_auth)
):
    """Set load balancing strategy for specific country"""
    load_balancer.set_country_strategy(country, strategy)
    return {"country": country, "strategy": strategy.value}

@router.post("/rebalance/{country}")
async def force_rebalance(country: str, user=Depends(verify_auth)):
    """Force rebalancing of connections in a country"""
    result = await load_balancer.rebalance_country(country)
    return {"country": country, "rebalanced_connections": result}

@router.get("/prediction/{country}")
async def get_load_prediction(country: str, hours: int = 1, user=Depends(verify_auth)):
    """Get load prediction for next N hours"""
    prediction = await load_balancer.predict_load(country, hours)
    return prediction

API Reference

Load Balancer Endpoints

Get Load Balancing Statistics

GET /api/load-balancer/stats
Authorization: Basic <credentials>

Response:

{
  "strategies": ["round_robin", "least_connections", "weighted_latency", "random", "health_score"],
  "round_robin_counters": {"us": 5, "uk": 2},
  "countries": {
    "us": {
      "node_count": 2,
      "total_connections": 45,
      "avg_connections_per_node": 22.5,
      "nodes": [
        {
          "id": "container_123",
          "server": "us5063.nordvpn.com",
          "connections": 25,
          "tailscale_ip": "100.73.33.15",
          "cpu_percent": 45.2,
          "health_score": 87.3
        }
      ]
    }
  }
}

Get Best Node for Country

GET /api/load-balancer/best-node/{country}?strategy=health_score
Authorization: Basic <credentials>

Response:

{
  "selected_node": {
    "id": "container_123",
    "country": "us",
    "server": "us5063.nordvpn.com",
    "tailscale_ip": "100.73.33.15",
    "health_score": 87.3
  },
  "strategy": "health_score",
  "country": "us"
}

Scale Up Country

POST /api/load-balancer/scale-up/{country}
Authorization: Basic <credentials>

Scale Down Country

POST /api/load-balancer/scale-down/{country}
Authorization: Basic <credentials>

Get Available Strategies

GET /api/load-balancer/strategies
Authorization: Basic <credentials>

Response:

{
  "strategies": [
    {
      "name": "round_robin",
      "description": "Distributes requests evenly across all healthy nodes"
    },
    {
      "name": "least_connections", 
      "description": "Routes to the node with fewest active connections"
    },
    {
      "name": "weighted_latency",
      "description": "Routes based on server latency with weighted randomization"
    },
    {
      "name": "random",
      "description": "Randomly selects from available healthy nodes"
    },
    {
      "name": "health_score",
      "description": "Routes to node with best overall health score (CPU, memory, latency, connections)"
    }
  ]
}

Troubleshooting

Common Issues

1. No Healthy Nodes Available

Symptoms: - API returns 404 "No healthy nodes available" - Load balancer cannot route traffic

Diagnosis:

# Check node health
curl -u admin:password http://localhost:8080/api/nodes/list | jq '.[] | select(.status == "running")'

# Check container health
docker ps --filter "label=vpn-exit-node"

# Check VPN connections
docker exec <container_id> curl -s ipinfo.io

Solutions: 1. Restart unhealthy containers: docker restart <container_id> 2. Check VPN credentials in /opt/vpn-exit-controller/configs/auth.txt 3. Verify network connectivity: docker exec <container_id> ping 8.8.8.8 4. Force failover: curl -X POST http://localhost:8080/api/failover/force/<node_id>

2. Load Imbalance

Symptoms: - One node has significantly more connections than others - Performance degradation on overloaded nodes

Diagnosis:

# Check connection distribution
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq '.countries'

# Check strategy
curl -u admin:password http://localhost:8080/api/config | jq '.load_balancer'

Solutions: 1. Switch to least_connections strategy 2. Force rebalancing: curl -X POST http://localhost:8080/api/load-balancer/rebalance/<country> 3. Increase connection drain timeout 4. Add more nodes: curl -X POST http://localhost:8080/api/load-balancer/scale-up/<country>

3. Frequent Failovers

Symptoms: - High number of failover events in logs - Unstable node assignments

Diagnosis:

# Check failover history
curl -u admin:password http://localhost:8080/api/failover/status | jq

# Check server health
curl -u admin:password http://localhost:8080/api/speed-test/summary | jq

Solutions: 1. Increase failover cooldown period 2. Check VPN server stability 3. Review health check thresholds 4. Blacklist problematic servers

4. Poor Performance

Symptoms: - Slow connection speeds - High latency

Diagnosis:

# Run speed tests
curl -X POST -u admin:password http://localhost:8080/api/speed-test/run-all

# Check health scores
curl -u admin:password http://localhost:8080/api/load-balancer/stats | jq '.countries[].nodes[].health_score'

Solutions: 1. Switch to weighted_latency strategy
2. Add more nodes in region 3. Use different VPN servers 4. Check network congestion

Debug Commands

# Enable debug logging
export LOG_LEVEL=DEBUG

# Check Redis data
redis-cli
> KEYS speedtest:*
> KEYS affinity:*
> KEYS server_health:*

# Monitor load balancer decisions
journalctl -u vpn-controller -f | grep "load_balancer"

# Test specific node
curl -X POST -u admin:password http://localhost:8080/api/speed-test/node/<node_id>

# Force strategy change
curl -X PUT -u admin:password http://localhost:8080/api/load-balancer/strategy/<country> \
  -H "Content-Type: application/json" \
  -d '{"strategy": "health_score"}'

Performance Optimization Tips

  1. Strategy Selection:
  2. Use health_score for general purpose
  3. Use weighted_latency for latency-sensitive apps
  4. Use least_connections for long-lived connections

  5. Resource Tuning:

  6. Monitor CPU/memory usage patterns
  7. Adjust scaling thresholds based on traffic
  8. Set appropriate connection limits

  9. Network Optimization:

  10. Choose VPN servers close to users
  11. Monitor and blacklist slow servers
  12. Use multiple servers per country

  13. Monitoring:

  14. Set up alerts for health score < 70
  15. Monitor failover frequency
  16. Track connection distribution

This comprehensive load balancing system ensures optimal performance, reliability, and scalability for the VPN Exit Controller infrastructure.