Detect Server Failure using DNS Health Check
SWG operations rely heavily on stable DNS connectivity to access servers on the web. Within the SWG Management console, you have the option to configure three DNS servers (Primary, Secondary, and Tertiary). The SWG prioritizes the first DNS server in the list for resolving domain names. However, if the primary server experiences a failure, there is a delay in connectivity until the SWG detects the issue and switches to the next available DNS server (Secondary). This delay persists until the primary server is restored or manually updated via the user interface, which can result in an inconsistent browsing experience for end users, impacting overall reliability.
Implementing periodic DNS Health Checks within the SWG would address this issue by enabling the SWG to proactively detect DNS failures. With this feature, the SWG can smoothly transition from an unreachable DNS server to an available one, facilitating early recovery and minimizing the impact on user experience. By ensuring a more reliable DNS resolution process, this enhancement would improve overall user satisfaction, system reliability, and network throughput.
How does SWG DNS Health Check work
Secure Web Gateway (SWG) reads the DNS server information from the Management UI and extracts the Primary, Secondary, and Tertiary DNS addresses. These addresses are then placed inside /etc/resolv.conf to be leveraged by SWG Proxy for Domain Name resolution.
By periodically polling the health of each of the DNS server entries stored inside /etc/resolv.conf, SWG aims to detect if any of the DNS servers is not performing and ultimately disabling from the DNS server list within /etc/resolv.conf, preventing unnecessary failed lookups.
If in subsequent polls, the DNS server is found to be healthy again, it would be added back to the DNS server list, as described in the Re-induction section.
The SWG DNS health check is created as a dnshealthcheck
service and deployed as a daemon, which periodically performs a health check against the first three nameserver entries in /etc/resolv.conf.
Tunable Parameters
The section indicates how the behavior of the health check is further customized based on the deployment.
The service uses systemd environment variables and expects the following variables to be defined in /etc/systemd/system/dnshealthcheck.service.d/var.conf
Variable name |
Mandatory |
Default value |
Range |
Description |
---|---|---|---|---|
DnsHealthCheck_FQDN |
✔ |
|
N/A |
This variable defines the healthcheck host name. The service tries to resolve the defined FQDN via the nameserver entries in /etc/resolv.conf. Clients are advised to select a stable FQDN according to the environment where SWG is deployed. NOTE: The service fails to start if this variable is not defined explicitly. |
DnsHealthCheck_Interval |
❌ |
5 |
[1,172800] |
This variable defines the frequency of the healthcheck runs in seconds. The recommended default value is 5 seconds. |
DnsHealthCheck_PrimaryFailureThreshold |
❌ |
1 |
[1,20] |
This variable defines the threshold value for disabling the primary DNS server. |
DnsHealthCheck_NonPrimaryFailureThreshold |
❌ |
20 |
[1,20] |
This variable defines the threshold value for disabling the non-primary DNS servers. |
DNS Server Failure Detection
The DNS Health Check service periodically polls the first 3 nameservers declared in /etc/resolv.conf
with the interval defined by DnsHealthCheck_Interval
. After the poll is over, should there be a change detected, one of the following events occurs for every DNS entry the poll has run for.
Primary DNS server failures
If the DNS server for which the resolution has failed is the primary DNS server, it is disabled after n consecutive failures, where n is defined by the tunable parameter DnsHealthCheck_PrimaryFailureThreshold.
NOTE: Since the Primary DNS server is the first server to be queried, Skyhigh recommends setting it to a lower threshold.
Set DnsHealthCheck_PrimaryFailureThreshold
to 1.
Secondary/Tertiary DNS server failures
Similarly, if the DNS server for which the resolution has failed is a non-primary DNS server (secondary or tertiary), it is disabled as well, but in that case, n is defined by the tunable parameter DnsHealthCheck_NonPrimaryFailureThreshold.
NOTE: Since Secondary and Tertiary DNS servers are not the first ones to be queried, Skyhigh recommends setting it to a slightly higher threshold so as to restrict frequent DNS configuration being reloaded to Proxy.
Set DnsHealthCheck_NonPrimaryFailureThreshold
to 20.
In a case where the primary DNS server gets disabled due to health check failure, the Secondary DNS server switches to acceptingDnsHealthCheck
_PrimaryFailureThreshold
as this becomes the first DNS server to be queried, and so should be subjected to Primary DNS Failure check frequency. Likewise, a similar flow would repeat for the Tertiary DNS server if both Primary and Secondary are detected to be failed.
DNS Server Re-induction
The DNS server for which the resolution is found to be successful is re-inducted into /etc/resolv.conf while preserving the order as found in the SWG Management UI. The order of the DNS server is then matched against the sequence as configured in the Primary, Secondary, and Tertiary DNS servers.
How to Set Up DNS Healthcheck
- Log in to the SWG shell.
2. Under /etc/systemd/system/, create a new folder called dnshealthcheck.service.d
3. In /etc/systemd/system/dnshealthcheck.service.d/, create a DNS Healthcheck configuration file called vars.conf
4. Open the DNS Healthcheck configuration file /etc/systemd/system/dnshealthcheck.service.d/vars.conf in a text editor.
Create the configuration based on the current deployment requirements.
For example:
[Service] Environment="DnsHealthCheck_FQDN=skyhighsecurity.com"
Environment="DnsHealthCheck_Interval=5"
Environment="DnsHealthCheck_PrimaryFailureThreshold=1"
Environment="DnsHealthCheck_NonPrimaryFailureThreshold=20
NOTE: Skyhigh suggests updating recommended threshold values, failing which affects the program efficiency. Contact support before making changes in the production environment.
- Save and close the file.
- Run
systemctl daemon-reload
- Run
systemctl start dnshealthcheck
How to Change DNS Health Check Configuration
- Follow steps 1 to 5 under How to Set Up DNS Health Check?.
- Run
systemctl restart dnshealthcheck
How to Stop DNS Health Check Service
Run systemctl stop dnshealthcheck
How to Enable DNS Health Check Service at Boot
Run systemctl enable dnshealthcheck
Troubleshoot DNS Health Check
The logs generated by the dnshealthcheck
service are written to /var/log/dnshealthcheck.log
- The service loaded the configuration successfully.
Configuration:
[Service]
Environment="DnsHealthCheck_FQDN=skyhighsecurity.com"
Environment="DnsHealthCheck_Interval=5"
Environment="DnsHealthCheck_PrimaryFailureThreshold=1"
Environment="DnsHealthCheck_NonPrimaryThreshold=20"
Log:
- The service failed to load the configuration due to missing a mandatory field.
Configuration:
[Service]
Environment="DnsHealthCheck_Interval=5"
Environment="DnsHealthCheck_PrimaryFailureThreshold=1"
Environment="DnsHealthCheck_NonPrimaryThreshold=20"
Log:
- The service failed to load the configuration due to a value being out of range.
Configuration:
[Service]
Environment="DnsHealthCheck_FQDN=skyhighsecurity.com"
Environment="DnsHealthCheck_Interval=500000"
Environment="DnsHealthCheck_PrimaryFailureThreshold=1"
Environment="DnsHealthCheck_NonPrimaryThreshold=2
Log:
- The service loaded the configuration successfully with default values
Configuration:
[Service]
Environment="DnsHealthCheck_FQDN=skyhighsecurity.com"
Log:
- The service disables
nameserver
after the threshold is crossed.