Failover for the LDT Communication Group Master and the LDT GuardPoint Group
Failover for the LDT Communication Group Master
CTE clients in an LDT Communication Group have a master/replica relationship. The first client that joins the group is initially designated as the master node. The other clients are replica nodes. The master node must be functional in order for communications to persist. All communications happen through the master node.
Graceful Failover for the LDT Communication Group Master
There are two cases for graceful failover:
-
Reboot
If the master node becomes unavailable because it’s rebooted by the administrator or as part of a maintenance cycle, but there are other live CTE clients in the LDT Communication Group, then the LDT Communication Group automatically elects a replica CTE client to become the new master node and fails over to it.
-
Power Off
When a master node powers off, a new master is automatically elected, but if the initial master node is being decommissioned, then you should remove it from the LDT Communication Group.
Note
Any CTE client that is being decommissioned, regardless of its role in the LDT Communication Group, needs to be removed from the LDT Communication Group See [Remove].
Ungraceful Shutdown/Failure for the LDT Communication Group Master
An ungraceful failure is when a CTE client fails abruptly due to a hardware or software failure. It means that the administrator did not trigger the client to either shutdown, reboot or become unreachable, but the client still transitioned into that state. This is applicable to the master node failing as well.
If an LDT Communication Group master node fails ungracefully, it is handled automatically. A new master is elected by the remaining nodes. Meanwhile, LDT operations may continue to fail for all LDT GuardPoints. Operations will resume once a new LDT Communication Group master node is functional.
In the event that all, or a majority of, the CTE clients in an LDT Communication Group fail (ungraceful shutdown):
Ensure that all of the CTE clients in the LDT Communication Group are active (rebooted, powered on and accessible through the network), to reestablish the cluster.
-
All LDT NFS/CIFS GuardPoints must be unguarded before you remove a CTE client from the LDT Communication Group.
-
Once a CTE client is removed from the LDT Communication Group, it must either be rebooted or shut down.
-
Whenever a graceful exit, or failover event, is in progress for the LDT Communication Group, LDT operations will be affected across the entire LDT Communication Group, which means that GuardPoints could be affected.
Failover for the LDT GuardPoint Group Master
Gracefully
Graceful Failover occurs when the primary node exits cleanly or gracefully. Then the LDT GuardPoint Group automatically fails over to one of the available secondary hosts, and the secondary host is promoted to the new primary host for the LDT Communication Group.
-
Triggered when the GuardPoint is disabled on the primary
-
Triggered when the
secfs
service is manually stopped on the primary -
Triggered during a reboot if
secfs
stops cleanly before shutdown -
Triggered when the primary node is removed from the LDT Communication Group
All CTE clients guarding the share, or trying to guard the share, depend on the primary being available.
Graceful Exits
Graceful exits occur when a node in the LDT GuardPoint Group:
-
Performs a reboot
-
Performs a Shutdown
-
Disables a GuardPoint from CipherTrust Manager
-
Unguards a GuardPoints
Note
Remember that if the primary client exits gracefully, then a new primary is elected through failover.
Ungracefully
Ungraceful Failover occurs when the primary node fails, unexpectedly, or ungracefully. Then an administrator will need to manually remove the primary from the group so that a secondary member can become the primary host. This also enables the failed primary host to guard again and become a member of the group in the future.
- Triggered if the primary crashes, or reboots, without a clean
secfs shutdown
. A secondary node is promoted to primary after executing a voradmin command.
To identify the primary node, type:
voradmin ldt group list <absolute guardpoint> (Windows)
voradmin ldt group list <guardpoint> (Linux)
!!! note
You can only run this command on a host where the ${gp} is currently guarded.
To restore, or repair, the original node after an ungraceful failover, type the following on the LDT GuardPoint Group secondary node:
voradmin ldt group repair <failed host> <guard path>
Example: If nodes A, B, and C are part of the LDT GuardPoint Group protecting the network share /test/path1
, and the primary node crashes and comes back online, run the following command on either node B or C:
# voradmin ldt group repair /<primary IP> <absolute GuardPoint> (Windows)
# voradmin ldt group repair /<primary ip> <${gp}> (Linux)
You can also use the repair command to remove a secondary node that is no longer guarding a GuardPoint but still shows in the Group Info list. This command repairs the host and keeps it in the group.
Example: If nodes A, B, and C are part of the LDT GuardPoint Group protecting the network share /test/path1
, and the primary node crashes and does not come back online, run the following command on either node B or C to remove a primary host from the group and automatically select a secondary host as the primary, type:
voradmin ldt group remove A /test/path1
If a secondary host fails ungracefully, you can use the same command to remove that host.
Note
If you plans to reboot a primary node for maintenance or a software update, a graceful primary failover is typically triggered as part of the secfs service shutdown. However, if stopping the secfs service takes too long, and the system proceeds to reboot before the shutdown completes, it can result in an ungraceful failover. To ensure a graceful primary failover, Thales recommends manually stopping the secfs service before initiating the reboot. This allows theGuardPoint to be cleanly disabled and failover to proceed as expected.
Note
Do not use this command in the case of a temporary failure when the failed primary host crashes and then reboots.
Ungraceful Exits
Ungraceful exits occur when a node in the LDT GuardPoint Group:
-
Is forced to power off | shutdown | reboot
-
Suffers a hardware failure
-
Power cycles
-
Has a kernel crash
Note
Remember that if the primary host fails ungracefully, then an administrator must manually remove the primary.
Troubleshooting Ungraceful Exits
After the reboot, if you see the following logs in secfsd.log
, and if secfsd -status guard
displays the GuardPoint as unguarded, then the reboot did not happen gracefully and the present scenario is classified as an ungraceful exit.
lgs_joining: host(xyz_1.com) is already PRIMARY for GPID(060cd923-6026-4341-81e3-f8cc367df90d) in comm server. It seems like a crash+reboot scenario. Informing others about primary failure
If this occurs, follow the node repair steps in the previous section, Failover for LDT GuardPoint Group to recover the structure.
-
Linux
secfsd.log
:/var/log/vormetric/secfsd.log
-
Windows
secfsd.log
:C:\ProgramData\Vormetric\DataSecurityExpert\agent\log\secfsd.log
Secondary Host Failure
In the event that a secondary host crashes, you must first remove that failed secondary host from the entire LDT GuardPoint Group that the host was a member of before allowing that CTE host to rejoin that same LDT GuardPoint Group after the host is restored. The following steps must be performed in exact order for each GuardPoint that was enabled on the host prior to the crash. (You can get the list of GuardPoints from CipherTrust Manager.) For each GuardPoint:
-
On any CTE client that is a member of the LDT GuardPoint Group for that GuardPoint, run the
voradmin ldt group list
command to identify the failed host. The role of the failed host is UNRESPONSIVE within the group, type:# voradmin ldt group list <GuardPoint Path> Role State Hostname GuardPoint Path UNRESPONSIVE N/A Host-3 N/A PRIMARY JOINED Host-1 /nfs/gp
-
On the primary CTE client for the GuardPoint , run the
voradmin ldt group remove
command to remove the failed host from the LDT GuardPoint Group , type:# voradmin ldt group remove <hostname> <GuardPoint Path>
Failure Scenario Examples
Following are some basic failure scenarios and suggested recovery steps:
Secondary node crash (Non-Triad node)
Scenario
-
Nodes A, B, C, D, E, and F are part of the LDT Communication Group.
-
Nodes A, B, and C are the Triad nodes.
-
An NFS/CIFS share is guarded on nodes D, E, and F, forming an LDT GuardPoint Group, where node D is the LDT primary.
After the LDT GuardPoint Group has been active for a while, node E, a secondary node, crashes.
Troubleshooting Scenario
If node E remains down:
-
From the LDT primary (node D), verify the status of the LDT GuardPoint Group, type:
voradmin ldt group check <guard path>
-
If node E displays as UNKNOWN, even after rebooting, check if the network share is mounted on E. If it's not mounted, fix the network connectivity. Once resolved, node E should rejoin the LDT GuardPoint Group as a secondary.
-
If node E does not resume, remove the node from the LDT GuardPoint Group. On the LDT primary node D, type:
voradmin ldt group remove E <guard path>
Primary Node Crash(Non-Triad Node)
Scenario
Same setup as above, with node D acting as the LDT primary. After some time, node D crashes.
Troubleshooting Scenario
-
If node D does not resume, remove the node from the LDT GuardPoint Group. On any secondary nodes, type:
voradmin ldt group remove D <guard path>
-
If node D resumes, but fails to guard the share, repair the LDT GuardPoint Group from any of the secondary nodes (E or F), type:
voradmin ldt group repair D <guard path>
Communication Group in Inconsistent State
Scenario
-
Nodes A, B, C, D, and E are part of the LDT Communication Group.
-
A, B, and C are Triad nodes.
-
An NFS/CIFS share is guarded on nodes C, D, and E, with D as the primary and C and E as secondaries.
Nodes A and B (both Triad nodes) fail at the same time (gracefully or ungracefully), the LDT Communication Group enters an inconsistent state.
Troubleshooting Scenario
Restart the secfsd service on all nodes to restore the LDT Communication Group.
For Windows:
-
Go to Control Panel > Services (local).
-
Select secfsd.
-
Select Stop the Service.
-
Wait a few moments and select, Start the service.
For Linux:
-
Type:
/etc/vormetric/secfs stop
-
Type:
/etc/vormetric/secfs start
Primary Node Ungraceful Reboot (Triad Node)
Scenario
-
Nodes A, B, C, and D are in the LDT Communication Group.
-
Nodes A, B, and C are the Triad nodes.
-
A NFS/CIFS share is guarded on all nodes
-
Node A is the primary for the GuardPoint.
Node A undergoes maintenance or a software update, and is rebooted with the expectation of a graceful failover to one of the secondary nodes. However, failover does not occur, and once node A comes back up, it fails to guard the network share and in the output of secfsd -status guard
, it displays: Group Join Failed, needs LDT repair
Explanation
During the shutdown, the secfs service is stopped, which disables all GuardPoints. A successful disable operation triggers failover to another secondary node. However, if the system proceeds with shutdown before the service is cleanly stopped, it results in an ungraceful shutdown, and the failover does not occur.
Troubleshooting Scenario
- Run the repair command from any active secondary node, type:
voradmin ldt group repair <failed hostname> <guard path>