Detecting Loss of NAS connection to an LDT GuardPoint Group
This section describes the LDT over NFS mechanism for detection of loss of a NAS (Network attached storage) connection to an LDT over NFS GuardPoint on all members of the LDT GuardPoint Group. LDT over NFS monitors LDT operations on NAS servers for detection of loss of NFS connection and implements steps to recover from loss of NAS access. This section describes the procedures and considerations for recovering an LDT over NFS GuardPoint from loss of NAS connection on a member of LDT GuardPoint Group.
Brief Overview of LDT Rekey Operations
Rekey operations within LDT GuardPoint Groups are executed on the primary host in coordination with other members of the LDT GuardPoint Group. Coordination includes interlocking access to files by applications and rekey operations. It also includes the primary host updating persistent LDT metadata on behalf of the application file or IO operations on any member of the LDT GuardPoint Group. Such active coordination is critical to the correct operations of LDT over NFS. Loss of NAS connection on any of the members in LDT GuardPoint Group disrupts the coordination between primary and other members. LDT manages loss of the NAS access on all members of the LDT GuardPoint Group to ensure data and metadata correctness with minimal impact to production workloads.
This section describes the LDT over NFS mechanism for detection of loss of NAS connection to an LDT over NFS GuardPoint on a member (primary/secondary) host of the group and also describes the procedures and considerations for recovering an LDT over NFS GuardPoint from the loss of NAS connection on a member host of the group.
Effects of loss of NAS connection on a GuardPoint
Loss of NAS connection on any member of an LDT GuardPoint Group, including primary hosts, will have negative effects on active LDT operations on the target NAS, and across the entire LDT GuardPoint Group and applications accessing files inside the affected GuardPoints. Regardless of the LDT status (active or suspended), connection loss of one member ripples through all of the LDT GuardPoint Group members. The following scenarios can occur during loss of NAS access while LDT operations are in active or suspended state:
-
Application file operations (read/write, truncate, rename, etc.) on files undergoing rekey can be blocked and consequently affect subsequent LDT operations on the same files. Rekey operations will be delayed, or even not started, as the result of pending file IO operations by applications.
-
Pending LDT rekey IO operations will not complete and consequently, block or prevent, subsequent rekey operations from starting as well as block applications attempting to access such files.
-
Rotation of a key will block launching rekey on the affected GuardPoints.
LDT over NFS I/O monitoring system to detect loss of NAS connection
To avoid the ripple effect of loss of NAS connection across entire LDT GuardPoint Groups, LDT monitors LDT level operations that access LDT metadata, such as MDS and LDT rekey attributes, and reads/modifies cipher-text data inside files undergoing rekey. LDT monitors progress on LDT operations in-progress. If an operation fails to complete within 30 seconds, LDT declares the pending operation incomplete and takes action to recover from the operation.
LDT over NFS Monitoring Control Thread
LDT starts a special control thread on every CTE host to monitor LDT operations for potential timeouts due to NAS connection loss. The monitoring begins when the first LDT over NFS protected GuardPoint is enabled. This thread monitors LDT operations on all LDT protected GuardPoints on the CTE host.
The LDT over NFS IO monitoring system monitors only LDT level operations. The control thread does not monitor application file operations on the files in the same GuardPoints that LDT over NFS monitoring thread is monitoring. If an application file operation hangs due to loss of NAS connection, the LDT over NFS monitoring control thread will detect it when the LDT operation starts and then timeouts on the same file.
Managing loss of NAS connection to a GuardPoint on the PRIMARY host
LDT takes action to recover from the loss of the NAS connection for LDT GuardPoint Group with multiple CTE clients as members of those groups. The action depends on whether the loss of the connection is detected on the primary host, or other members of the LDT GuardPoint Group. LDT takes no action to recover the LDT GuardPoint Group consisting of one member, a single node. The member of such a group is the LDT primary host for the GuardPoint, and LDT operations are blocked in the event of loss of a NAS connection with those operations resuming after the NAS connection is restored. The recovery action for Groups multiple clients depends on the type of loss.
Managing Primary host losing NAS connection to a GuardPoint - Primary Demotion
As soon as the monitoring thread detects the loss of the NAS connection to a GuardPoint on the primary host, the control thread initiates Primary Self-Demotion to demote the primary host from the primary role and responsibility for the GuardPoint. In this scenario, LDT on the primary host will start the demotion process for the affected GuardPoint in order to elect another member of LDT GuardPoint Group for the primary role. Election of another member triggers the failover process of LDT operations from the demoted primary host to the newly elected primary host. The election and the failover process is transparent and enables LDT to continue rekey operations using the newly elected primary host for all healthy members of the LDT GuardPoint Group.
The detection and recovery action resumes LDT operations and is fully transparent. It does not involve administrator intervention except when finishing rekey operations that were in progress on the demoted primary host when the loss of the NAS connection was detected. From the perspective of the newly promoted primary host, those files are in LDT INCOMPLETE
status and resuming rekey on LDT INCOMPLETE
files requires administrator intervention.
When the failover from primary role occurs, the newly elected primary host identifies the files undergoing rekey and marks those files as LDT INCOMPLETE
status. At this stage, the self-demoted primary host remains in the LDT GuardPoint Group as an active member but without access to the NAS server. Due to the loss of the NAS access, LDT on the self-demoted primary host blocks all new user attempts to access files in the GuardPoint. It then continues to participate in the transformation of the remaining files in the GuardPoint with the newly elected primary host.
Rekey Completion on GuardPoint with INCOMPLETE Files
Although the newly-elected primary host resumes rekey and transforms the remaining files in the GuardPoint, it does not resume transformation on INCOMPLETE files until the self-demoted primary host has been recovered through the manual administrator intervention. Files in INCOMPLETE rekey
status are the last files in the GuardPoint after administrator intervention.
Recovering self-demoted primary host
The self-demoted primary host can be recovered manually by shutting down the host with the host membership removed from the LDT GuardPoint Group. A successful shutdown process that halts the host automatically removes the host membership from the LDT GuardPoint Group. Therefore, it is critical for the self-demoted primary host to reach the halt state at the completion of the shutdown. However, the shutdown process may not reach the halt state due to pending IO operations as the result of the loss of NAS connection.
Shutdown failure on self-demoted primary host
If you are using LDT over NFS with GuardPoints across multiple NAS shares, and a primary node loses the connection with a NAS server but does not reach the halt state, then the other GuardPoints from the other NAS shares, that are also guarded on that self-demoted primary node, must be manually removed from the LDT GuardPoint Group for recovery.
When you must manually force the host to halt, this is referred to as an ungraceful failure. Remedy this issue from the newly promoted primary server:
-
Remove the self-demoted primary host from the GuardPoint. Type:
voradmin ldt group remove <hostname_of_self-demoted primary> <guardpoint_path>
Note
- When you run the
voradmin ldt group remove
command for other GuardPoints, from other NAS servers present on that node, and if the node is the primary for any of those GuardPoints, then primary failover is triggered.
- When you run the
-
Repeat the command on any other GuardPoints on other NAS servers that are guarded by the self-demoted primary host.
-
Run the recover command for the GuardPoints that were on the node that lost connection to the NAS server. Type:
# voradmin ldt demotion recover <guardpoint_pathname>
For more information, see the following two sections:
Self-demoted primary host reaches HALT state
If the shutdown process on the self-demoted primary does reach the halt state, then the self-demoted primary host has already been removed from LDT GuardPoint Group. Your next step in the recovery process depends on the status of NAS connection of the host.
If a loss of NAS connection persists, disable the GuardPoints on NAS servers that cannot be accessed on the self-demoted primary host:
-
Change the guard status of the GuardPoints configured for Auto Guard to remain disabled on the self-demoted primary host
-
Do not enable GuardPoints configured for Manual Guard if the self-demoted primary host has rebooted
You can then proceed with the following voradmin command to resume rekey operations on the files in INCOMPLETE status. This command must be executed on the newly elected primary host.
# voradmin ldt demotion recover <guardpoint_pathname>
If the loss of NAS connection has been resolved by the time the self-demoted primary host reboots, you can allow CTE start-up services to enable the GuardPoints configured for Auto-Guard on the host, or you can manually enable a GuardPoint configured for Manual Guard. After enabling the GuardPoints, you can perform the following command on the newly-elected primary host to resume rekey on files currently in INCOMPLETE status:
use
# voradmin ldt demotion recover
Upon completion of the recovery steps, the remaining files in INCOMPLETE status will be rekeyed and the GuardPoint itself will transition to rekeyed state.
Note
Self-demoted primary hosts must be rebooted after the promotion of a member to primary host.
Handling loss of the NAS connection to a GuardPoint on a non-PRIMARY host
In the event of a non-primary host losing NAS connection, the LDT over NFS IO monitoring thread on the non-primary host will detect the loss of the NAS connection and trigger self-isolation from LDT GuardPoint Group.
The outcome of self-isolation is that LDT on the non-primary host blocks all users’ access to the affected GuardPoint and begins acknowledging all LDT requests from the primary host. Effectively, it does not perform subsequent operations requested by the primary host, but it sends positive responses to the primary host so that the primary host can continue rekey operations.
Recovery steps for recovery from a self-isolated secondary host are the same as those for the recovery of a self-demoted primary host.
LDT over NFS demotion-related Alarms to CipherTrust Manager
LDT sends the following alarms to CipherTrust Manager during primary host self-demotion, election of a new primary host after demotion, recovery of the self-demoted primary host, and recovery and transformation of the INCOMPLETE files on the new primary host:
[CGS3347i] LDT over NFS-ALERT: Primary host of GuardPoint [GuardPoint] demoting
LDT on the primary host sends this alarm to notify of the loss of the NAS connection on the primary host, for the specified GuardPoint, and the initiation of the self-demotion process.
[CGS3348i] LDT over NFS-ALERT: Primary host of GuardPoint [GuardPoint] demoted
LDT on the self-demoted primary host sends this alarm for the specified GuardPoint when the self-demotion process completes.
[CGS3349e] LDT over NFS-ALERT: Demotion of Primary host of GuardPoint [GuardPoint] failed, error [ErrorNumber]
LDT on the primary host sends this alarm if the self-demotion attempt fails for the specified GuardPoint. The error code included in the alarm indicates the reason for failure. Notify Thales Support in the event of this alarm.
[CGS3355i] LDT over NFS-ALERT: INCOMPLETE file. After DEMOTION, Admin intervention is required for full recovery, GuardPoint [GuardPoint] objID [InodeNumber] error [ErrorNumber]
LDT on the newly-elected primary host sends this alarm for each file detected in INCOMPLETE status.
[CGS3357i] LDT over NFS-ALERT: Manual recovery invoked to recover the DEMOTED GuardPoint[GuardPoint]
LDT sends this alarm from the newly-elected primary host when the administrator executes the voradmin
command to begin the recovery process.
[CGS3358e] LDT over NFS-ALERT: DEMOTED GuardPoint recovered on the newly-elected primary host while the old primary host is still alive, objID [InodeNumber]
LDT sends this alarm from the self-demoted primary host before the recovery steps, if the self-demoted host receives a request for an LDT operation on a file in INCOMPLETE status from the newly-promoted host. This alarm indicates that the newly promoted host has begun recovery of files in INCOMPLETE status without first recovering the self-demoted primary host.
[CGS3354e] LDT over NFS-ALERT: Manual recovery failed on DEMOTED GuardPoint [GuardPoint] objID [InodeNumber] error [ErrorNumber]
LDT sends this alarm from the newly elected primary host, during recovery of an INCOMPLETE file, when it encounters any errors during LDT rekey related to the recovery of the file or during the resumption of rekey on the file.
[CGS3359I] LDT-NFS-ALERT: Secondary host DEMOTED due to loss of NAS connection, GuardPoint [GuardPoint]
LDT on a self-demoted secondary host sends this alarm for the specified GuardPoint when the self-demotion process completes on the self-demoted secondary.