Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

How To Troubleshoot High Availability Issues In Sophos XG

High availability is a clustering technology that is used to maintain uninterrupted service in the event of power, hardware or software failure.

HA Configuration Modes:

Active Passive: In an active passive mode, the primary device will process all the network traffic. The auxiliary device participates in the cluster but does not process any network traffic. The auxiliary device is in standby state until the primary appliance fails, then the auxiliary device will become the primary device and will process all of the network traffic.

Active Active: In active active mode, both devices will process the network traffic. All requests will go to the primary device, which will do load balancing and forward the traffic to the auxiliary device for processing when needed. If the primary device fails, the auxiliary device will become the primary device and will process all of the network traffic.

HA Prerequisites:

Both devices in high availability should use same hardware model, firmware version and hardware revision.

Both devices must be registered.

Both devices must have same number of interfaces.

In Active Active, both devices require license with same subscription modules enabled.

In Active Passive, one license is required for primary device. No license is needed for the auxiliary device.

Cables to all the monitored ports on both devices must be connected. Connect the dedicated HA link port of both devices with either a cross over or straight through cable.

Specify the same port a HA link port on both devices. (If you use port C, it must be HA link port on both devices)

On both devices, the dedicated HA link port must be a member of DMZ zone and must have unique IP address. The HA link port’s IP addresses of both peers must belong to the same subnet. The peers uses this link to communicate cluster information and synchronize with each other.

SSH must be enabled on DMZ zone.

Cellular WAN configuration is not supported in any HA mode.

High Availability Process:

HA process occurs in three steps:

  1. Sanity Check
  2. Prepare System For HA
  3. Configuration Sync

Sanity Check:

  1. Connect peer appliance on port 22 using passphase configured.
  2. It checks model no, vendor details and firmware version on both the appliance.
  3. HA status should be disabled on auxiliary appliance.
  4. There should not be alias on VLAN on dedicated HA port.
  5. Overriding MAC address is not supported on dedicated port.
  6. Speed/duplex and MTU/MSS should be default on both the appliances.
  7. SSH should be enabled on DMZ zone for dedicated port so primary appliance can connect on port 22 and push configuration file.
  8. All monitoring ports should be in up state.

Prepare System For HA

  1. Primary appliance sync /conf/msync.conf (config file for HA) file with auxiliary appliance.
  2. Primary appliance sync /conf/ctsyncd.conf (connection tracking for HA) with auxiliary appliance.
  3. It generates virtual MAC address and sync this information with peer appliance.
  4. Sync original MAC addresses on both the appliances.
  5. Generates configuration file for HA.

Configuration Sync:

  1. Msyncd service will perform softboot on auxiliary appliance.
  2. Once auxiliary appliance comes up, it will join HA cluster.
  3. During the time when peer is joining HA, it will sync configuration from primary appliance and primary appliance will remain in freeze state. During this state it will not accept any configuration changes until HA state becomes primary – auxiliary.

Troubleshooting HA:

Before going to troubleshoot we need to know the relevant log files for HA

  1. /log/msync.log for HA configuration service.
  2. /log/ctsyncd.log for conntrack synchronization service.
  3. /log/applog.log for HA configuration and status update.
  4. /log/csc.log central service which manage all services.

You can identify HA status by typing this command “grep “ha:” /log/applog.log”

Primary appliance log when HA enable is called from the GUI. Primary appliance will perform three steps in order to form HA. Primary appliance will perform sanity check, prepare for HA and at last it will sync configuration with peer appliance.

In primary appliance log file “fwm:enableha successfully done” shows that sanity check is done successfully on local appliance.

Enableha on peer done” shows HA is enabled on peer appliance.

enableha: HA is enabled now” shows that sanity check is done successfully on peer appliance.

Once sanity check is performed on the appliances, the primary appliance will change its HA state to 5:2, meaning it will become standalone device first before primary-auxiliary state.

Issue-1: HA could not be enabled

High availability error
high availability error

Troubleshooting:

  1. Verify whether all HA prerequisites are met.
  2. Both appliances must be of same model.
  3. Both appliances should have same firmware version.
  4. Dedicated port must be in DMZ zone.
  5. SSH access should be enabled for DMZ zone on both appliances.
  6. No alias or VLAN should be configured on dedicated port.
  7. Link speed/duplex and MTU/MSS should be default on dedicated port.
  8. Check connectivity to auxiliary appliance LAN IP and dedicated port IP address. Verify that all cables are properly connected.
  9. Check if HA configured on the auxiliary appliance.
  10. Verify passphrase used in HA configuration.

Sample logs:

tail -f /log/applog.log | grep ha:
Feb 27 04:40:48 ha: exchangehakeys: set peer status “ssh keys exchanged” failed
Feb 27 04:40:48 ha: configurehassh: exchange keys failed
Feb 27 04:40:48 ha: enableha: configure SSH faile

In the above SSH is not enabled in the Administration >> Device access (DMZ zone) that is the reason, we were getting SSH keys failed error.

tail -f /log/applog.log | grep ha:
Sep 01 19:29:41 enableha: enableha called from GUI
Sep 01 19:29:43 enableha: peer sanity check failed !!!

If you get sanity check failed as above log output means appliance did not met HA prerequisites or auxiliary device is not set with correct passphrase.

Access Auxiliary device and go to System > System Services > HA > Select Initial HA Device State as Auxiliary > Define Passphrase. Now access primary and use same Passphrase as configured in Auxiliary and activate HA.

Issue-2: Failover happened

Troubleshooting:

The failover will be caused when port you have added in monitoring port become down.

In case if failover happens, you can verify monitoring port status by typing this command “dmesg” in the advanced shell.

Sample logs:

dmesg | grep PortE
[91050.685754] e1000e: PortE NIC Link is Down
[91056.569355] e1000e: PortE NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[94001.480011] e1000e: PortE NIC Link is Down
[94982.037735] e1000e: PortE NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[94987.696290] e1000e: PortE NIC Link is Down

To resolve this issue, you need to check the cable connection and status of those monitoring ports and make them work properly.

Issue-3: Both devices becomes standalone

Troubleshooting: This issue happens when there is an issue with dedicated port or dedicated interface connected to both firewalls.

To overcome this issue, please replace the cable connected to both firewalls.

To resolve this issue, you need to disable HA, connect the dedicated interface again and configure HA.

Hope this article helps you.