Failover
CTE clients in an LDT Communication Group have a master/replica relationship. The first client that joins the group is initially designated as the master node. The other clients are replica nodes. The master node must be functional in order for communications to persist. All communications happen through the master node.
A host can leave the LDT Communication Group in one of two ways:
-
Graceful Exit
One CTE client is powered off, but there are other live CTE clients in the LDT Communication Group so failover can occur and another CTE client in the group is elected to become the master.
-
Ungraceful Failure
One or more of the CTE clients in an LDT Communication Group fail abruptly. (Typically due to a kernel crash, hardware failure or power cycle.)
Graceful Exit for the LDT Communication Group Master
There are two cases for graceful exits:
-
Reboot
If the master node becomes unavailable because it’s rebooted by the administrator or as part of a maintenance cycle, but there are other live CTE clients in the LDT Communication Group, then the LDT Communication Group automatically elects a replica CTE client to become the new master node and fails over to it.
-
Power Off
When a master node powers off, a new master is automatically elected, but if the initial master node is being decommissioned, then you should remove it from the LDT Communication Group.
Any CTE client that is being decommissioned, regardless of its role in the LDT Communication Group, needs to be removed from the LDT Communication Group See [Remove].
Ungraceful Shutdown/Failure for the LDT Communication Group Master
An ungraceful failure is when a CTE client fails abruptly due to a hardware or software failure. It means that the administrator did not trigger the client to either shutdown, reboot or become unreachable, but the client still transitioned into that state. This is applicable to the master node failing as well.
If an LDT Communication Group master node fails ungracefully, it is handled automatically. A new master is elected by the remaining nodes. Meanwhile, LDT operations may continue to fail for all LDT GuardPoints. Operations will resume once a new LDT Communication Group master node is functional.
In the event that all, or a majority of, the CTE clients in an LDT Communication Group fail (ungraceful shutdown):
Ensure that all of the CTE clients in the LDT Communication Group are active (rebooted, powered on and accessible through the network), to reestablish the cluster.
-
All LDT NFS/CIFS GuardPoints must be unguarded before you remove a CTE client from the LDT Communication Group.
-
Once a CTE client is removed from the LDT Communication Group, it must either be rebooted or shut down.
-
Whenever a graceful exit, or failover event, is in progress for the LDT Communication Group, LDT operations will be affected across the entire LDT Communication Group, which means that GuardPoints could be affected.
Failover for LDT GuardPoint Group
If a primary host exits gracefully, the LDT GuardPoint Group automatically fails over to a secondary host. All CTE clients guarding the share, or trying to guard the share, depend on the primary being available.
If the primary host fails ungracefully, then an administrator will need to manually remove the primary from the group so that another host can become the primary host and guarding can resume normally.
-
To identify the primary node, type:
# voradmin ldt group info /<guardpoint> \\Windows # voradmin ldt group list <guardpoint> \\Linux
-
To remove a primary host from the group and automatically select another host as the primary, type:
# voradmin ldt group remove <hostname> <guardpoint>
Do not use this command in the case of a temporary failure like a crash and reboot.
If a secondary host fails ungracefully, you can use the same command to remove that host.
-
If a primary node crashes and reboots quickly, and is trying unsuccessfully to guard a share, you can repair it with the following command:
# voradmin ldt group repair /<guardpoint> (Windows) # voradmin ldt group repair <guardpoint> (Linux)
You can also use the repair command to remove a secondary node that is no longer guarding a GuardPoint but still shows in the Group Info list. This command repairs the host and keeps it in the group.
Secondary Host Failure
In the event that a secondary host crashes, you must first remove that failed secondary host from the entire LDT GuardPoint Group that the host was a member of before allowing that CTE host to rejoin that same LDT GuardPoint Group after the host is restored. The following steps must be performed in exact order for each GuardPoint that was enabled on the host prior to the crash. (You can get the list of GuardPoints from CipherTrust Manager.) For each GuardPoint:
-
On any CTE client that is a member of the LDT GuardPoint Group for that GuardPoint, run the
voradmin ldt group list
command to identify the failed host. The role of the failed host is UNRESPONSIVE within the group, type:# voradmin ldt group list <GuardPoint Path> Role State Hostname GuardPoint Path UNRESPONSIVE N/A Host-3 N/A PRIMARY JOINED Host-1 /nfs/gp
-
On the primary CTE client for the GuardPoint , run the
voradmin ldt group remove
command to remove the failed host from the LDT GuardPoint Group , type:# voradmin ldt group remove <hostname> <GuardPoint Path>