VPN Exit Controller Troubleshooting Guide¶
This guide provides step-by-step troubleshooting procedures for common issues with the VPN Exit Controller system. Use this as your primary reference when diagnosing problems.
Quick Reference¶
Essential Commands¶
# System status
systemctl status vpn-controller
systemctl status docker
docker ps -a
# View logs
journalctl -u vpn-controller -f
docker logs vpn-api
docker logs vpn-redis
# API health check
curl -u admin:Bl4ckMagic!2345erver http://localhost:8080/api/status
# Container access
pct enter 201 # From Proxmox host
Log Locations¶
- System Service:
journalctl -u vpn-controller - API Logs:
docker logs vpn-api - Redis Logs:
docker logs vpn-redis - Traefik Logs:
/opt/vpn-exit-controller/traefik/logs/traefik.log - HAProxy Logs:
/opt/vpn-exit-controller/proxy/logs/ - VPN Node Logs:
docker logs <node-container-name> - OpenVPN Logs: Inside containers at
/var/log/openvpn.log
1. Node Issues¶
1.1 VPN Nodes Failing to Start¶
Symptoms: - Containers exit immediately after starting - API shows nodes as "failed" or "stopped" - Cannot connect to VPN endpoints
Common Error Messages:
ERROR: Auth file not found at /configs/auth.txt
VPN connection timeout after 30 seconds
Cannot allocate TUN/TAP dev dynamically
Diagnostic Steps:
-
Check container status:
-
Verify auth file exists:
-
Check VPN configuration files:
-
Test OpenVPN configuration manually:
# Inside a test container docker run -it --rm --cap-add=NET_ADMIN --device=/dev/net/tun \ -v /opt/vpn-exit-controller/configs:/configs \ ubuntu:22.04 bash # Then install and test OpenVPN apt update && apt install -y openvpn openvpn --config /configs/vpn/us.ovpn --auth-user-pass /configs/auth.txt --verb 3
Resolution Steps:
-
Fix auth file issues:
-
Fix container permissions:
-
Update NordVPN configs if outdated:
1.2 NordVPN Authentication Problems¶
Symptoms: - Authentication failures in OpenVPN logs - "AUTH_FAILED" messages - Containers restart in loops
Diagnostic Steps:
-
Verify credentials:
-
Test credentials manually:
-
Check for service token vs regular credentials:
Resolution Steps:
- Use NordVPN service credentials:
- Generate service credentials from NordVPN dashboard
-
Update auth.txt with service credentials (not regular login)
-
Reset authentication:
# Clear any cached auth rm -f /opt/vpn-exit-controller/configs/auth.txt # Recreate with correct service credentials echo "service_username" > /opt/vpn-exit-controller/configs/auth.txt echo "service_password" >> /opt/vpn-exit-controller/configs/auth.txt chmod 600 /opt/vpn-exit-controller/configs/auth.txt
1.3 Tailscale Connection Issues¶
Symptoms: - Nodes start but don't appear in Tailscale admin - "tailscale up" command fails - Exit node not advertised properly
Diagnostic Steps:
-
Check Tailscale auth key:
-
Check Tailscale daemon:
-
Verify container networking:
Resolution Steps:
- Generate new auth key:
- Go to Tailscale admin console
- Generate new auth key with exit node permissions
-
Update .env file with new key
-
Restart Tailscale in container:
-
Check firewall rules:
1.4 Container Restart Loops¶
Symptoms: - Containers continuously restart - High CPU usage - "Restarting" status in docker ps
Diagnostic Steps:
-
Check restart policy:
-
Monitor restart events:
-
Check exit codes:
Resolution Steps:
- Identify root cause:
- VPN connection failures
- Tailscale authentication issues
-
Network connectivity problems
-
Temporary fix - stop restart:
-
Fix underlying issue and restart:
2. Network Issues¶
2.1 Proxy Connection Failures¶
Symptoms: - HTTP 502/503 errors from proxy endpoints - Timeouts when connecting through proxy - "upstream connect error" messages
Diagnostic Steps:
-
Check HAProxy status:
-
Test direct container connectivity:
-
Check backend health:
Resolution Steps:
-
Restart proxy containers:
-
Check proxy configuration:
-
Update backend servers:
2.2 DNS Resolution Problems¶
Symptoms: - Domain names not resolving in containers - "Name resolution failed" errors - Proxy working with IPs but not domains - "Doesn't support secure connection" errors in incognito mode - HTTPS websites failing through proxy
Diagnostic Steps:
-
Test DNS in containers:
-
Check container DNS settings:
-
Test Tailscale DNS configuration:
-
Test host DNS:
Resolution Steps:
-
Fix VPN container DNS (Primary Fix for HTTPS errors):
-
Ensure Tailscale DNS is disabled:
-
Test DNS resolution through VPN:
-
Legacy system DNS fix (if needed):
-
Restart networking if changes made:
DNS Resolution Fix Explained
The recent update implements a comprehensive DNS fix that resolves HTTPS errors in incognito mode:
- Root Cause: Tailscale's DNS resolution was interfering with HTTPS certificate validation
- Solution: Disable Tailscale DNS (
--accept-dns=false) and use NordVPN DNS servers - Implementation: Containers manually configure
/etc/resolv.confwith NordVPN DNS first, Google DNS as fallback - Result: Eliminates "doesn't support secure connection" errors and improves proxy reliability
2.3 SSL Certificate Issues¶
Symptoms: - HTTPS endpoints returning certificate errors - "SSL handshake failed" messages - Browser certificate warnings
Diagnostic Steps:
-
Check Traefik certificate status:
-
Test certificate validity:
-
Check DNS for ACME challenge:
Resolution Steps:
-
Force certificate renewal:
-
Check Cloudflare API credentials:
-
Manual certificate debug:
2.4 Port Binding Conflicts¶
Symptoms: - "Port already in use" errors - Containers failing to start - Services not accessible on expected ports
Diagnostic Steps:
-
Check port usage:
-
Check Docker port mappings:
-
Identify conflicting services:
Resolution Steps:
-
Kill conflicting processes:
-
Change service ports:
-
Restart in correct order:
3. Service Issues¶
3.1 FastAPI Application Not Starting¶
Symptoms: - systemctl shows failed status - "Connection refused" on port 8080 - Python import errors in logs
Diagnostic Steps:
-
Check systemd service status:
-
Test manual startup:
-
Check Python environment:
Resolution Steps:
-
Fix Python dependencies:
-
Check environment variables:
-
Fix import errors:
3.2 Redis Connection Problems¶
Symptoms: - "Redis connection failed" in API logs - Cache-related operations failing - High latency in API responses
Diagnostic Steps:
-
Check Redis container:
-
Test Redis connectivity:
-
Check Redis configuration:
Resolution Steps:
-
Restart Redis:
-
Clear Redis data if corrupted:
-
Check disk space:
3.3 Docker Daemon Issues¶
Symptoms: - "Cannot connect to Docker daemon" errors - Docker commands hanging - Containers not starting
Diagnostic Steps:
-
Check Docker service:
-
Test Docker functionality:
-
Check Docker socket:
Resolution Steps:
-
Restart Docker:
-
Clean up Docker resources:
-
Check disk space:
3.4 Systemd Service Failures¶
Symptoms: - Service won't start at boot - "Failed to start" messages - Service stops unexpectedly
Diagnostic Steps:
-
Check service definition:
-
View detailed logs:
-
Test service script manually:
Resolution Steps:
-
Fix service dependencies:
-
Update service configuration:
-
Reload and restart:
4. Load Balancing Issues¶
4.1 Nodes Not Being Selected Properly¶
Symptoms: - All traffic going to one node - Load balancer ignoring some nodes - Uneven traffic distribution
Diagnostic Steps:
-
Check load balancer status:
-
View node health:
-
Check HAProxy backend status:
Resolution Steps:
-
Reset load balancer:
-
Update backend weights:
-
Restart load balancing service:
4.2 Health Check Failures¶
Symptoms: - Nodes marked as unhealthy - False positive health failures - Health checks timing out
Diagnostic Steps:
-
Check health check configuration:
-
Test health endpoints manually:
-
Check health check logs:
Resolution Steps:
-
Adjust health check timeouts:
-
Fix health endpoint:
-
Restart health monitoring:
4.3 Speed Test Failures¶
Symptoms: - Speed tests returning errors - Incorrect speed measurements - Speed test endpoints unreachable
Diagnostic Steps:
-
Test speed test API:
-
Check network connectivity:
-
View speed test logs:
Resolution Steps:
-
Update speed test endpoints:
-
Increase timeout values:
-
Use alternative speed test method:
4.4 Failover Not Working¶
Symptoms: - Failed nodes still receiving traffic - No automatic failover - Manual failover not working
Diagnostic Steps:
-
Check failover configuration:
-
Test failover trigger:
-
Check monitoring service:
Resolution Steps:
-
Restart monitoring service:
-
Update failover thresholds:
-
Manual failover:
5. Proxy Issues¶
5.1 HAProxy Configuration Errors¶
Symptoms: - HAProxy fails to start - Configuration validation errors - Syntax errors in logs
Diagnostic Steps:
-
Validate configuration:
-
Check configuration file:
-
View HAProxy logs:
Resolution Steps:
-
Fix syntax errors:
-
Restore backup configuration:
-
Restart with corrected config:
5.2 Traefik Routing Problems¶
Symptoms: - 404 errors on proxy domains - Requests not reaching HAProxy - Routing rules not working
Diagnostic Steps:
-
Check Traefik dashboard:
-
View Traefik configuration:
-
Check routing rules:
Resolution Steps:
-
Update dynamic configuration:
-
Restart Traefik:
-
Verify domain DNS:
5.3 Country Routing Not Working¶
Symptoms: - All traffic routed to same country - Country-specific domains not working - Incorrect geolocation
Diagnostic Steps:
-
Test country-specific endpoints:
-
Check HAProxy ACL rules:
-
Test IP geolocation:
Resolution Steps:
-
Update HAProxy routing rules:
-
Restart proxy services:
-
Check node assignments:
5.4 Proxy Service Failures¶
Symptoms: - Squid or Dante proxy services not responding - Connection refused on ports 3128 or 1080 - Health check endpoint (port 8080) not responding - Container restart loops involving proxy services
Diagnostic Steps:
-
Check proxy service status in container:
# Check if proxy processes are running docker exec <container> ps aux | grep -E "(squid|danted)" # Check if ports are listening docker exec <container> netstat -tuln | grep -E "(3128|1080|8080)" # Test proxy services directly docker exec <container> curl -I http://localhost:3128 docker exec <container> curl -I http://localhost:8080/health -
Check proxy service logs:
-
Test proxy connectivity from host:
Resolution Steps:
-
Restart proxy services in container:
-
Restart entire container:
-
Check Squid configuration:
-
Check Dante configuration:
5.5 Authentication Failures¶
Symptoms: - 401/403 errors on proxy requests - Authentication not working - Unauthorized access attempts
Diagnostic Steps:
-
Test API authentication:
-
Check authentication configuration:
-
View authentication logs:
Resolution Steps:
-
Update credentials:
-
Restart services:
-
Clear authentication cache:
6. Performance Issues¶
6.1 Slow Proxy Speeds¶
Symptoms: - High latency through proxy - Slow download/upload speeds - Timeouts on large requests
Diagnostic Steps:
-
Test direct vs proxy speed:
-
Check network utilization:
-
Monitor container resources:
Resolution Steps:
-
Optimize HAProxy configuration:
-
Scale up resources:
-
Optimize VPN connections:
6.2 High Latency¶
Symptoms: - Slow response times - High ping times - Delayed connections
Diagnostic Steps:
-
Measure latency at different points:
-
Check routing:
-
Monitor network queues:
Resolution Steps:
-
Select closer VPN servers:
-
Optimize TCP settings:
-
Reduce proxy hops:
6.3 Memory or CPU Issues¶
Symptoms: - High CPU usage - Out of memory errors - System slowdowns
Diagnostic Steps:
-
Check system resources:
-
Monitor container resources:
-
Check for memory leaks:
Resolution Steps:
-
Increase LXC container resources:
-
Optimize container limits:
-
Clean up resources:
6.4 Connection Timeouts¶
Symptoms: - Connections dropping - Timeout errors - Incomplete transfers
Diagnostic Steps:
-
Check timeout settings:
-
Monitor connection states:
-
Test with different timeout values:
Resolution Steps:
-
Increase timeout values:
-
Optimize connection pooling:
-
Scale connection limits:
7. Monitoring Issues¶
7.1 Metrics Not Being Collected¶
Symptoms: - Empty metrics endpoints - No data in monitoring dashboards - Metrics API returning errors
Diagnostic Steps:
-
Check metrics endpoint:
-
Verify metrics collector service:
-
Check Redis for metrics data:
Resolution Steps:
-
Restart metrics collection:
-
Clear corrupted metrics:
-
Check metrics configuration:
7.2 Health Checks Failing¶
Symptoms: - All nodes showing as unhealthy - Health check endpoints not responding - False positive failures
Diagnostic Steps:
-
Test health endpoints manually:
-
Check health check configuration:
-
Monitor health check frequency:
Resolution Steps:
-
Adjust health check parameters:
-
Fix health endpoint implementation:
-
Reset health check state:
7.3 Log Analysis Techniques¶
Key Log Locations and Commands:
-
System-wide issues:
-
Network issues:
-
VPN connection issues:
-
Performance analysis:
Emergency Recovery Procedures¶
Complete System Reset¶
If the system is completely broken, follow these steps:
-
Stop all services:
-
Clean up Docker:
-
Reset configuration:
-
Restart services:
Data Recovery¶
If data is corrupted:
-
Backup current state:
-
Reset Redis data:
-
Restart with clean state:
Network Recovery¶
If networking is broken:
-
Reset network interfaces:
-
Flush iptables:
-
Restart Tailscale:
Prevention and Maintenance¶
Regular Maintenance Tasks¶
- Weekly:
- Check disk space:
df -h - Review error logs:
journalctl -u vpn-controller --since "1 week ago" | grep -i error -
Test proxy endpoints:
curl -x proxy-us.rbnk.uk:8080 http://httpbin.org/ip -
Monthly:
- Update NordVPN configurations:
./scripts/download-nordvpn-configs.sh - Clean Docker resources:
docker system prune -
Review and rotate Tailscale auth keys
-
Quarterly:
- Update system packages:
apt update && apt upgrade - Review and update SSL certificates
- Performance optimization review
Monitoring Setup¶
Set up automated monitoring:
-
Log monitoring:
-
Resource monitoring:
-
Health monitoring:
This troubleshooting guide should help you diagnose and resolve most issues with the VPN Exit Controller system. Keep it updated as you encounter new issues and solutions.