Failover
When an HA group is running normally the client library continues to schedule commands across all members as described above. The client continuously monitors the health of each member at two different levels:
> First, the connectivity with the member is monitored at the networking layer. Disruption of the network connection invokes a fail-over event within a twenty second timeout.
>Second, every command sent to a device is continuously monitored for completion. Any command that fails to complete within twenty seconds also invokes a fail-over event. Most commands are completed within milliseconds. However, some commands can take extended periods to complete – either because the command itself is time-consuming (for example, key generation), or because the device is under extreme load. To cover these events the HSM automatically sends “heartbeats” every two seconds for all commands that have not completed within the first two seconds. The twenty second timer is extended every time one of these heartbeats arrives at the client, thus preventing false fail-over events.
A failover event involves dropping a device from the available members in the HA group. All commands that were pending on the failed device are transparently rescheduled on the remaining members of the group. When a failure occurs, the application experiences a latency stall on some of the commands in process (on the failing unit) but otherwise sees no impact on the transaction flow. Note that the least-busy scheduling algorithm automatically minimizes the number of commands that stall on a failing unit during the twenty second timeout.
If the primary unit fails, clients automatically select the next member in the group as the new primary. Any key management or single-part cryptographic operations are transparently restarted on a new group member. In the event that the primary unit fails, any in-progress, multi-part, cryptographic operations must be restarted by the application, as the operation returns an error code.
As long as one HA group member remains functional, cryptographic service is maintained to an application no matter how many other group members fail. As discussed in Failover, members can also be put back into service without restarting the application.
How Do You (or Software) Know That a Member Has Failed?
When an HA Group member first fails, the HA status for the group shows "device error" for the failed member. All subsequent calls return "token not present", until the member (HSM Partition or PKI token) is returned to service.
At the library level, what happens when a device fails or doesn’t respond?
The client library drops the member and continues with others. It will try to reconnect that member at a minimum retry rate of once per minute (configurable) for the number of times specified in the configuration file, and then stop trying that member. You can specify a number of retries from 3 to an unlimited number.
What happens to an application if a device fails mid-operation? What if it’s a multi-part operation?
Multi part operations do not fail over. The entire operation returns a failure (CKR_DEVICE_ERROR). Your application deals with the failure in whatever way it is coded to do so.
Any operation that fails mid-point would need to be re-sent from the calling application. This is more likely to happen in a multi-part operation because those are longer, but a failure could conceivably happen during a single atomic operation as well.
With HA, if the library attempts to send a command to an HSM and it is unavailable, it will automatically retry sending that command to the next HSM in the configuration after the timeout expires.
Multi-part operations would typically be block encryption or decryption, or any other command where the previous state of the HSM is critical to the processing of the next command. It is understandable that these need to be re-sent since the HSMs do not synchronize ‘internal memory state,’ only stored key material.
Reaction to Failures
This section looks at possible failures in an overall HA system, and what needs to be done. The assumption is that HA has been In a complex system, it is possible to come up with any number of failure scenarios, such as this (partial) list for an HA group:
>Failure at the HSM or appliance
•HSM card failure
•HSM re-initialization
•Deactivated partition
•Power failure of a member
•Reboot of member
•NTL failure
>Failure at the client
•Power failure of the client
•Reboot of client
•Network keepalive failure
>Failure between client and group members
•Network failure near the member appliance
(so only one member might disappear from client's view)
•Network failure near the client
(client loses contact with all members)
HSM-Side Failures
The categories of failure at the HSM side of an HA arrangement are temporary or permanent.
Temporary
Temporary failures like reboots, or failures of power or network are self-correcting, and as long as you have set HA autorecovery parameters that are sufficiently lenient, then recovery is automatic, shortly after the HSM partition becomes visible to the HA client.
Permanent
Permanent failures require overt intervention at the HSM end, including possibly complete physical replacement of the unit, or at least initialization of the HSM.
All that concerns the HA service is that the particular unit is gone, and isn't coming back. If an entire SafeNet Luna Network HSM unit is replaced, then you must go through the entire appliance and HSM configuration of a new unit, before introducing it to the HA group. If a non-appliance HSM (resides in the Client host computer, e.g., SafeNet Luna PCIe HSM or SafeNet Luna USB HSM) is replaced, then it must be initialized and a new partition created.
Either way, your immediate options are to use a new name for the partition, or to make the HA SafeNet Luna HSM Client forget the dead member (LunaCM command hagroup removemember) so you can reuse the old name. Then, you must ensure that automatic synchronization is enabled (LunaCM command hagroup synchronize -enable), and manually introduce a new member to the group (LunaCM command hagroup addmember). After that, you can carry on using your application with full HA redundancy.
Because your application should be using only the HA virtual slot (LunaCM command hagroup haonly), your application should not have noticed that one HA group member went away, or that another one was added and synchronized. The only visible sign might have been a brief dip in performance, but only if your application was placing high demand on the HSM(s).
Client-Side Failures
For SafeNet Luna Network HSM, any failure of the client (such as operating system problems), that does not involve corruption or removal of files on the host, should resolve itself when the host computer is rebooted.
If the host seems to be working fine otherwise, but you have lost visibility of the HSMs in LunaCM or your client, verify that the SafeNet drivers are running, and retry. If that fails, reboot. If that fails, restore your configuration from backup of your host computer. If that fails, re-install SafeNet Luna HSM Client, re-perform certificate exchanges, creation of HA group, adding of members, setting HAOnly, etc.
For SafeNet Luna PCIe HSM and SafeNet Luna USB HSM, the client is the host of the HSMs, so if HA has been working, then any sudden failure is likely to be OS or driver related (so restart) or corruption of files (so re-install). If a re-install is necessary, you will need to recreate the HA group and re-add all members and re-assert all settings (like HAOnly).
Failures Between the HSM and Client (SafeNet Luna Network HSM only)
The only failure that could likely occur between a SafeNet Luna Network HSM (or multiple HSMs) and a client computer coordinating an HA group is a network failure. In that case, the salient factor is whether the failure occurred near the client or near one (or more) of the SafeNet Luna Network HSM appliances.
If the failure occurs near the client, and you have not set up port bonding on the client, then the client would lose sight of all HA group members, and the client application would fail. The application would resume according to its timeouts and error-handling capabilities, and HA would resume automatically if the members reappeared within the recovery window that you had set.
If the failure occurs near a SafeNet Luna Network HSM member of the HA group, then that member might disappear from the group until the network failure is cleared, but the client would still be able to see other members, and would carry on normally.
If the recovery window is exceeded, then you must manually restart HA.