Recovery

After a failure, the recovery process is typically straightforward. Depending on the deployment, an automated or manual recovery process might be appropriate. In either case there is no need to restart an application.

Automatic recovery

With automatic recovery, the client automatically performs periodic recovery attempts while a member is failed. The frequency of these checks is adjustable and the number of re-tries can be limited. Each time a reconnection is attempted, one application command experiences a slight delay while the client attempts to recover. As such, the retry frequency cannot be set any faster than once per minute. Even if a manual recovery process is selected, the application does not need to be restarted. Simply run the client recovery command and the recovery logic inside the client makes a recovery attempt the next time the application uses the HSM. As part of recovery, any key material created while the member was offline is automatically replicated to the recovered unit.

Automatic recovery is disabled by default. Use the command hagroup retry to turn it on or off. If retry=0, automatic recovery is disabled. Any other retry value enables automatic recovery.

Failed units

Sometimes a failure of a device is permanent. In this event, the only solution is to deploy a new member to the group. In this case, you can remove the failed unit from the HA group, add a new device to the group and then start the recovery process. The running clients automatically resynchronize keys to the new member and start scheduling operations to it. See Adding, Removing, Replacing, or Reconnecting HA Group Members for more information.

Manual recovery

Finally, sometimes both an HSM and application fail at the same time. If no new key material was created while an HSM was offline, the recovery is straightforward: simply return the HSM to service and then restart the application. However, if new key material was created after an HSM failed but before the application failed, a manual re-synchronization (using the hagroup synchronize command) might be required.

To perform a manual recovery, you confirm which member, or members, have the current key material (normally the unit that was online at the time the application failed). Put them back in service with the application. Then, for each member that has stale key material (a copy of an object that was deleted; or an old copy of an object whose attributes were changed), delete all their key material after making sure they are not part of the HA group. Be particularly careful that the member is not part of the HA group or the action might destroy active key material by causing an accidental synchronization during the delete operation. After the HSM is cleared of key material, rejoin it to the group and the synchronization logic automatically repopulates the device’s key material from the active units.

Usage

When a client is configured to use auto recovery the manual recovery commands must not be used. Invoking them can cause multiple concurrent recovery processes which result in error codes and possible key corruption .

Most customers should enable auto-recovery in all configurations. We anticipate that the only reason you might wish to choose manual recovery is if you do not want to change the retry time for periodic transactions. That is, each time a recovery is attempted a single application thread experiences an increased latency while the library uses that thread to attempt the re-connection (the latency impact is a few hundred milliseconds).

Recovery Conditions

HA recovery is hands-off resumption by failed HA Group members, or it is manual re-introduction of a failed member, if autorecovery is not enabled. Some reasons for a member to fail from the group might be:

>The appliance loses power (but regains power in less than the 2 hours that the HSM preserves its activation state).

>The network link from the unit is lost and then regained.

HA recovery takes place if the following conditions are true:

>HA autorecovery is enabled, or if you detect a unit failure and manually re-introduce the unit (or its replacement)

>HA group has at least 2 nodes

>HA node is reachable (connected) at client startup

>HA node recover retry limit is not reached. Otherwise manual recover is the only option to bring back the downed connection(s)

If all HA nodes fail (no links from client) no recovery is possible.

The HA recovery logic makes its first attempt at recovering a failed member when your application makes a call to its HSM (the HA group). An idle client does not start the recovery-attempt process. As of release 6.22, if the retry count is not 0, then recovery is attempted after the configured HA interval expires.

On the other hand, a busy client would notice a slight pause every minute, as the library attempts to recover a dropped HA group member (or members) until the member has been reinstated or until the timeout has been reached and it stops trying. Therefore, set the number of retries according to your normal situation (the kinds and durations of network interruptions you experience, for example).

Enabling and Configuring Autorecovery

In previous releases, autorecovery was not on by default, and needed to be explicitly enabled.

Beginning with SafeNet Luna HSM release 6.0, HA autorecovery is automatically enabled when you set the recovery retry count using the LunaCM command hagroup retry. Use the command hagroup interval to specify the interval, in seconds, between each retry attempt. The default is 60 seconds.

Failure of All Members

If all members of an HA group were to fail, then all logged-in sessions are gone, and operations that were active when the last group member went down, are terminated. If the client application is able to recover all that state information, then it is not necessary to restart or re-initialize in order to resume client operations with the SafeNet Luna Network HSM HA group. All sessions will be restarted without requiring a restart of the client.

Automatic Reintroduction

Automatic reintroduction is supported. A failed (and fixed, or replacement) HSM appliance can be re-introduced if the application continues without restart. Restarting the application causes it to take a fresh inventory of available HSMs, and to use only those HSMs within its HA group. You cannot reintroduce a SafeNet Luna Network HSM that was not in the group when the application started.

Auto-insert

Automatic reintroduction or "auto-insert" is supported. A failed (and fixed, or replacement) HSM appliance can be re-introduced if the application continues without restart. Restarting the application causes it to take a fresh inventory of available HSMs, and to use only those HSMs within its HA group. You cannot [re]introduce a SafeNet Luna Network HSM that was not in the group when the application started.

Auto-insert is now the default behavior (from Client 6.2.1 and later). [list below satisfies LHSM-31162]

1.A running client automatically detects SafeNet Luna Network HSM appliance insertion and removal to/from its configuration.

2.Connection to the new SafeNet Luna Network HSM appliance occurs only if the client HA configuration also has a new HA member or an HA member gone missing.

3.A running client does not automatically disconnect from the appliance that has been removed from its configuration until the appliance goes offline (for example, disconnected or powered down).

4.A running client uses the new HA member that is being added to the HA group configuration and does not require the client to restart to do so.

5.A running client stops attempting to use the removed HA member that is being revoked from the HA configuration and does not require the client to restart to do so.

6.When a new member is added to the HA group, entries similar to the following appear in the client HA Log:

Mon Feb  1 11:06:55 2016 : [6619] HA group: 11079656446993 detected new member member: 286668019649
 
Mon Feb  1 11:07:25 2016 : [6619] HA group: 11079656446993 recovery attempt #1 succeeded for member: 286668019649

 

7.When a HA member is removed from the HA group, entries similar to the following appear in the client HA Log:

Mon Feb  1 11:07:45 2016 : [6619] HA group: 11079656446993 member: 286668019649 revoked

 

8.When a new SafeNet Luna Network HSM appliance is registered with a client that has HA configured with “Active recovery mode”, entries similar to the following appear in the client HA Log:

Sun Jan 31 21:01:52 2016 : [3820] HA subsystem detected new server : 192.20.11.175
 
Sun Jan 31 21:01:56 2016 : [3820] HA subsystem server 192.20.11.175 connected
 

Entries like these appear only if item 3, above, is true. [LHSM-31294]

9.When an existing SafeNet Luna Network HSM appliance is removed from client that has HA configured with “Active recovery mode”, entries similar to the following appear in the client HA Log:

Tue Feb  2 15:45:12 2016 : [28001] HA subsystem detected removal of server : 192.20.11.86

Synchronization

Synchronization of token objects is a manual process using the hagroup synchronize command. Synchronization locates any object that exists on any one physical HSM partition (that is a member of the HA group), but not on all others, and replicates that object to any partitions (among the group) where it did not exist.

This is distinct from the replication that occurs when you create or delete an object on the HA virtual slot. Creation or deletion against the virtual slot causes that change to be immediately replicated to all connected members (addition or deletion).

Effect of PED Operations

PED operations block cryptographic operations, so that while a member of an HA group is performing a PED operation, it will appear to the HA group as a failed member. When the PED operation is complete, failover and recovery HA logic are invoked to return the member to normal operation.

Network failures

If network connectivity fails to one or more connected SafeNet Luna Network HSM appliances, the HA group will be restored automatically subject to timeouts and retries, as follows:

>While the client application is active, and one HA group member is connected and active, other members can automatically resume in the HA group as long as retries have not stopped.

>If all members fail or if the client does not have a network connection to at least one group member, then the client application must be restarted, unless you have recoveryMode activeEnhanced enabled.

Process interaction

Other events and processes interact at different levels and in different situations as described below.

NOTE   All references to NTLS also apply to STC. Both NTLS and STC provide secure client-appliance connections.

At the lowest communication level, the transport protocol (TCP) is responsible for making and operating the communication connection between client and appliance (whether HA is involved or not). For SafeNet Luna Network HSM, the default protocol timeout of 2 hours was much too long, so SafeNet configured that to 3 minutes when HA is involved. This means that:

>In a period of no activity by client or appliance, the appliance's TCP will wonder if the client is still there, and will send a packet after 3 minutes of silence.

>If that packet is acknowledged, the 3 minute TCP timer restarts, and the cycle repeats indefinitely.

>If the packet is not acknowledged, then TCP sends another after approximately 45 seconds, and then another after a further 45 seconds. At the two minute mark, with no response, the connection is considered dead, and higher levels are alerted to perform their cleanup.

So altogether, a total of five minutes can elapse since the last time the other participant was heard from. This is at the transport layer.

Above that level, the NTLS layer provides the connection security and some other housekeeping. Any time a client sends a request for a cryptographic operation, the HSM on the appliance begins working on that operation.

While the HSM processes the request, appliance-side NTLS sends a "keep-alive PING" every two seconds, until the HSM returns the answer, which NTLS then conveys across the link to the requesting client. NTLS (nor any layer above) does not perform any interpretation of the ping.

It simply drops a slow, steady trickle of bytes into the pipe, to keep the TCP layer active. This normally has little effect, but if your client requests a lengthy operation like an 8192-bit keygen, then the random-number-generation portion of that operation could take many minutes to complete, during which the HSM would legitimately be sending nothing back to the client. The NTLS ping ensures that the connection remains alive during long pauses.

Configuration settings

In the SafeNet configuration file, "DefaultTimeout" (default value is 500 seconds) governs how long the client will wait for a result from an HSM, for a cryptographic call. In the case of SafeNet Luna Network HSM, the copy of the config file inside the appliance is not accessible externally. The config file in the client installation is accessible to modify, but "DefaultTimeout" in that file affects only a locally connected HSM (such as might be the case if you had a SafeNet Luna Backup HSM attached to your client computer). The config file in the client has no effect on the configuration inside the network-attached SafeNet Luna Network HSM appliance, and thus can have no effect on the interaction between client and SafeNet Luna Network HSM appliance.

"ReceiveTimeout" is how long the library will wait for a dropped connection to come back.

If "ReceiveTimeout" is tripped, for a given appliance, the HA client stops talking to that appliance and deals with the remaining members of the HA group to serve your application's crypto requests.

A minute later, the HA client tries to contact the member that failed to reply.
If the connection is successfully re-established, the errant appliance resumes working in the group, being assigned application calls as needed (governed by application workload and HA logic).

If the connection is not successfully re-established, the client continues working with the remaining group members. Another minute passes, and the client once again tries the missing appliance to see if it is ready to actively resume working in the HA group.

The retries continue until the missing member resumes, or until the pre-set (by you) number of retries is reached (maximum of 500). If the retry count is reached with no success, the client stops trying that member. The failed appliance is still a member of the group (it is still in the list of HA group members maintained on the client), but the client no longer tries to send it application calls, and no longer encourages it to establish a connection. You must fix the appliance (or its network connection) and manually recover it into the group for the client to resume including it in operations.

Active Autorecovery on a SafeNet Luna Network HSM

NOTE   All references to NTLS also apply to STC. Both NTLS and STC provide secure client-appliance connections.

Autorecovery uses the HA Active Recovery Thread (ARCT) to manage recovery from a failure. The ARCT sends a non-session-based message that is processed by NTLS. This allows recovery as soon as a failed member returns. Thus, if a failed member returns to duty before an active member fails, then synchronization occurs immediately, and the secondary member is ready to take over from the active member if that now fails.

Members can reconnect without the need to call finalize/initialize in the client application, which allows for multiple services that use a single JVM to recover connections independently.

In the event that all HA members fail to respond to the ARCT probing message, the HA slot is deemed to be unrecoverable.

The recovery mode on a SafeNet Luna Network HSM is the basic active mode. As long as the retry count is not 0, recovery is active basic be default.

The enhanced active recovery mode is optional, and is controlled by the LunaCM hagroup recoverymode command.