Home >

Administration Guide > High-Availability (HA) Configuration and Operation > Failover

Failover

When an HA group is running normally the client library continues to schedule commands across all members as described above. The client continuously monitors the health of each member at two different levels:

First, the connectivity with the member is monitored at the networking layer. Disruption of the network connection invokes a fail-over event within a twenty second timeout.

Second, every command sent to a device is continuously monitored for completion. Any command that fails to complete within twenty seconds also invokes a fail-over event. Most commands are completed within milliseconds. However, some commands can take extended periods to complete – either because the command itself is time-consuming (for example, key generation), or because the device is under extreme load. To cover these events the HSM automatically sends “heartbeats” every two seconds for all commands that have not completed within the first two seconds. The twenty second timer is extended every time one of these heartbeats arrives at client, thus preventing false fail-over events.

A failover event involves dropping a device from the available members in the HA group. All commands that were pending on the failed device are transparently rescheduled on the remaining members of the group. When a failure occurs, the application experiences a latency stall on some of the commands in process (on the failing unit) but otherwise sees no impact on the transaction flow . Note that the least-busy scheduling algorithm automatically minimizes the number of commands that stall on a failing unit during the twenty second timeout.

If the primary unit fails, clients automatically select the next member in the group as the new primary. Any key management or single-part cryptographic operations are transparently restarted on a new group member. In the event that the primary unit fails, any in-progress, multi-part, cryptographic operations must be restarted by the application, as the operation returns an error code.

As long as one HA group member remains functional, cryptographic service is maintained to an application no matter how many other group members fail. As discussed in Failover , members can also be put back into service without restarting the application.

How Do You (or Software) Know That a Member Has Failed?

When an HA Group member first fails, the HA status for the group shows "device error" for the failed member. All subsequent calls return "token not present", until the member (HSM Partition or PKI token) is returned to service.

At the library level, what happens when a device fails or doesn’t respond?

This is two separate situations. A device might not be responding because something is blocking (such as PED operations prior to HSM firmware 6.24) or, for example, because the requested operation is an RSA keygen for a keysize of 4096 or larger. Likely the device will come back. The client continues to wait so long as it receives heartbeats from the HSM (for SafeNet Network HSM, that would be as long as the NTLS connection remains alive).

A failure would be an actual failure message from the HSM (or for SafeNet Network HSM, could also be that the NTLS connection dropped). In the case of a failure, the client library drops the member and continues with others. It will try to reconnect that member at a minimum retry rate of once per minute (configurable) as long as it continues to receive heartbeats from the HSM for the pending command, and then stop trying that member. You can specify a number of retries from 3 to an unlimited number.

What happens to an application if a device fails mid-operation? What if it’s a multi-part operation?

Multi part operations do NOT fail over. The entire operation returns a failure. Your application deals with the failure in whatever way it is coded to do so.

Any operation that fails mid-point would need to be resent from the calling application. That is, if you don’t receive a ‘success’ response, then you must try again. This is obviously more likely to happen in a multi-part operation because those are longer, but a failure could conceivably happen during a single atomic operation as well.

With HA, if the library attempts to send a command to an HSM and it is unavailable, it will automatically retry sending that command to the next HSM in the configuration after the timeout expires.

Multi-part operations would typically be block encryption or decryption, or any other command where the previous state of the HSM is critical to the processing of the next command. It is understandable that these need to be re-sent since the HSMs do not synchronize ‘internal memory state’ … only stored key material.

Reaction to Failures

This section looks at possible failures in an overall HA system, and what needs to be done. The assumption is that HA has been properly configured and the HA group has been seen to be functioning properly. In a complex system, it is possible to come up with any number of failure scenarios, such as this (partial) list for an HA group:

Failure at the HSM or appliance

HSM card failure

HSM re-initialization

Deactivated partition

Power failure of a member

Reboot of member

NTL failure

STC failure

Failure at the client

Power failure of the client

Reboot of client

Network keepalive failure

Failure between client and group members

Network failure near the member appliance
(so only one member might disappear from client's view)

Network failure near the client
(client loses contact with all members)

HSM-Side Failures

The categories of failure at the HSM side of an HA arrangement are temporary or permanent.

Temporary

Temporary failures like reboots, or failures of power or network are self-correcting, and as long as you have set HA automatic recovery parameters that are sufficiently lenient, then recovery is automatic shortly after the HSM partition becomes visible to the HA client.

Permanent

Permanent failures require overt intervention at the HSM end, including possibly complete physical replacement of the unit, or at least initialization of the HSM.

All that concerns the HA service is that the particular unit is gone, and isn't coming back. If an entire SafeNet Network HSM unit is replaced, then obviously you must go through the entire appliance and HSM configuration of a new unit, before introducing it to the HA group. If a non-appliance HSM (resides in the Client host computer, such as SafeNet PCIe HSM or SafeNet USB HSM) is replaced, then it must be initialized and a new partition created.

Either way, your immediate options are

to use a new name for the partition, or

to make the HA SafeNet HSM Client forget the dead member (lunacm command ha removeMember) so you can reuse the old name.

Then, you must ensure that automatic synchronization is enabled (lunacm command ha synchronize -enable), and manually introduce a new member to the group (lunacm command ha addMember). After that, you can carry on using your application with full HA redundancy.

Because your application should be using only the HA virtual slot (lunacm command ha HAOnly), your application should not have noticed that one HA group member went away, or that another one was added and synchronized. The only visible sign might have been a brief dip in performance, but only if your application was placing high demand on the HSM(s).

Client-Side Failures

For SafeNet Network HSM, any failure of the client (such as operating system problems), that does not involve corruption or removal of files on the host, should resolve itself when the host computer is rebooted.

If the host seems to be working fine otherwise, but you have lost visibility of the HSMs in lunacm or your client, verify that the SafeNet drivers are running, and retry.

If that fails, reboot.

If that fails, restore your configuration from backup of your host computer.

If that fails, re-install SafeNet HSM Client, re-perform certificate exchanges, creation of HA group, adding of members, setting HAOnly, etc.

For SafeNet PCIe HSM, and SafeNet USB HSM, the client is the host of the HSMs, so if HA has been working, then any sudden failure is likely to be

OS or driver related (so your response is to restart) or

corruption of files (so your response is to re-install).

If a re-install is necessary, you will need to recreate the HA group and re-add all members and re-assert all settings (like HAOnly).

Failures Between the HSM and Client (SafeNet Network HSM only)

The only failure that could likely occur between a SafeNet Network HSM (or multiple SafeNet Enterprise HSMs) and a client computer coordinating an HA group is a network failure. In that case, the salient factor is whether the failure occurred

near the client or

near one (or more) of the SafeNet Network HSM appliances.

If the failure occurs near the client, and you have not set up port bonding on the client, then the client would lose sight of all HA group members, and the client application would fail. The application would resume according to its timeouts and error-handling capabilities, and HA would resume automatically if the members reappeared within the recovery window that you had set.

If the failure occurs near a SafeNet Network HSM member of the HA group, then that member might disappear from the group until the network failure is cleared, but the client would still be able to see other members, and would carry on normally.

If the recovery window is exceeded, then you must manually restart HA, or use lunacm to trigger a manual recover request so that the application tries to recover again. Manual recovery performs only a single retry.