Troubleshooting TCP Health Check

To troubleshoot TCP Health Check, enable connection tracing to debug the TCP connections being done for the servers present in the health checklist.

Go to Configuration > Troubleshooting > Enable tracing for TCP health checks.

The connection traces files are available to view in: /opt/mwg/log/debug/connection_tracing.

The file name starts with ‘TCPH’ e.g. TCPH-000001-S.txt

In case the request to NHP is blocked, Protocol.FailureDescription contains a property that will have the appropriate reason, which is sent to the user with the property value.

Example Scenarios

Suppose IP1:Port1 and IP2:Port2 are configured in the health checklist.

IP1:Port1: Enabled

IP2:Port2: Enabled

Suppose a health check failed for both Ips i.e., both IP1 and IP2 are unhealthy.

The same IP’s are configured in the NHP list. And this list is selected during traffic.

Traffic behavior:
- SWG will not try connecting with any of the Ips since they are not healthy.
- A block page with an appropriate failure description will be sent to the client.
Failure Description:
- No healthy proxy was found. HealthCheck Status:HealthCheckFail):NHP 1.2.3.4:9090 HealthCheck Status:HealthCheckFail):NHP 10.140.210.94:9091:badgateway:server state 1:state 9:Application response 502 badgateway.

Suppose IP1:Port1 and IP2:Port2 configured in the health check list.

IP1:Port1: Enabled

IP2:Port2: Enabled

Suppose health check is success for IP2:Port2 and fails for IP1:Port1 i.e. IP1 is unhealthy and IP2 is healthy.

Same IP’s are configured in the NHP list.

Suppose IP1:Port1 and IP2:Port2 configured in the health check list, both are unreachable.

IP1:Port1: Disabled-Healthy

IP2:Port2: Disabled-Unhealthy

IP1 is healthy. IP2 is unhealthy.

Same IP’s are configured in the NHP list.

Traffic behavior:
- SWG tries to connect to IP1, connection will fail, connection with Ip2 is not attempted as its unhealthy.
- A block page with appropriate failure description will be sent to the client.
Failure Description:
- Connection timed out:NHP <IP1:Port1> No healthy proxy found. HealthCheck Status:HealthCheckFail):NHP <IP2:Port2>:badgateway:server state 1:state 9:Application response 502 badgateway

Suppose IP1:Port1 and IP2:Port2 configured in the health check list. IP1 is unreachable.

IP1:Port1: Disabled-Healthy

IP2:Port2: Enabled

Health check is failed for IP2.

IP1 is healthy. IP2 is unhealthy.

Traffic behavior:
- SWG connects to IP1, connection will fail, connection with Ip2 is not attempted as its unhealthy.
- A block page with appropriate failure description will be sent to the client.
Failure Description:
- Connection timed out:NHP <IP1:Port1> No healthy proxy found. HealthCheck Status:HealthCheckFail):NHP <IP2:Port2>:badgateway:server state 1:state 9:Application response 502 badgateway

Suppose IP1:Port1 configured in the health check list. IP1 is unreachable.

IP1:Port1: Disabled-Healthy

IP1 is unreachable.

IP1 and IP2 are configured in the NHP list.

IP1 is part of health check list. IP2 is not part health check list. IP2 is unreachable.

Traffic behavior:
- SWG connects to IP1(since its healthy), connection will fail, connection with IP2 is attempted.
- Default old handling of connection will come into effect for IP2 (as its not part of health check list). i.e. SWG will retry the connection with IP2 for ‘Number of retries’ times. If it fails, SWG will mark this IP2 as down for ‘After final failure wait’ seconds and a dashboard alert will be raised for the same.
  - e.g. mwgappl15943511 28-Jun-2023 04:26:56 UTC WARNING: Next hop proxy 10.140.221.73:9091 has been marked as down for 10 seconds due to error 'Connection refused' (Origin: Proxy, ID: 710, 3 times within last 4 minutes)
  - Failure Description:
    - Connection timed out:NHP 1.2.3.4:9090Connection timed out:NHP 2.2.2.2:9090:badgateway:server state 1:state 9:Application response 502 badgateway

Suppose FQDN: Port is configured in the health checklist, SWG will perform a DNS query to resolve the fqdn.
Suppose DNS resolution returned multiple IP addresses IP1, IP2, IP3 and IP1 and IP2 are unreachable
- SWG performs a health check for IP1. It fails. So, it goes to the next IP i.e. IP2, it also fails. Then SWG performs a health check for IP3. It is successful and SWG marks the server (fqdn: port) as healthy
- SWG then performs the health check with IP3. Its is status shown as success and SWG marks the server (FQDN: Port) as healthy.
During live traffic, SWG checks the status of FQDN: Port which is healthy and SWG is connects to IP3.
- In this scenario, the DNS query is avoided for the live traffic.