High Availability

SafeNet Luna Network HSM products include the capability to group multiple devices into a single logical group – known as an HA (High Availability) group. Applications only see a virtual HSM that is a consolidation of all the HSMs in your HA group. Operations and key material from those HSMs are automatically synchronized to the application.

When an HA group is defined, cryptographic services remain available to the consuming applications as long as at least one member in the group remains functional and connected to the application server. In addition, many cryptographic commands are automatically distributed across the HA group to enable performance gains for many applications.

HSMs and appliances are unaware that they might be configured in an HA group. This allows you to configure HA on a per-application basis. The way you group your HSMs depends on your circumstances and desired performance. See HA Configuration Overview.

Once you have set up an HA group, you can configure several options:

The requirements for implementing High Availability are outlined in Requirements

HA Configuration Overview

As of SafeNet Luna HSM release 6.x, the SafeNet high availability function supports the grouping of up to thirty-two members. However, the maximum practical group size for your application is driven by a trade-off between performance and the cost of replicating key material across the entire group. The number of HSMs per group of application servers varies based on the application use case but, as depicted in HA Sample Configuration, groups of three are typical.

Figure 1: HA Sample Configuration

As performance needs grow beyond the performance capacity of three HSMs, it often makes sense to define a second independent HA group of application servers and HSMs to further isolate applications from any single point of failure. This has the added advantage of facilitating the distribution of HSM and application sets in different data centers.

Key Material Replication

Whenever an application creates key material, the HA functionality transparently replicates the key material to all members of the HA group before reporting back to the application that the new key is ready.

The HA library always starts with what it considers its primary HSM (initially the first member defined in an HA group). Once the key is created on the primary it is automatically replicated to each member in the group. If a member fails during this process, the key replication to the failed member is aborted after the failover time out. If any member is unavailable during the replication process (that is, the unit failed before or during the operation), the HA library keeps track of this and automatically replicates the key when that member rejoins the group.

Once the key is replicated on all active members of the HA group, a success code is returned to the application.

Load Balancing

The default behavior of the client library is to attempt to load-balance the application’s cryptographic requests across the entire set of devices in the HA group. The top level algorithm is a round-robin scheme that is modified to favor the least busy device in the set. As each new command is processed, the SafeNet Luna Network HSM client looks at how many commands it has scheduled on every device in the group. If all devices have an equal number of outstanding commands the new command is scheduled on the next device in the list. However, if the devices have a different number of commands outstanding on them, the new command is scheduled on the device with the fewest commands queued. This modified round-robin has the advantage of biasing load away from any device currently performing a lengthy command.

The least-busy algorithm uses the number of commands outstanding on each device as the indication of its busyness.

Single-part vs. multi-part operations

In addition to this least-busy bias, the type of command also affects the scheduling algorithm. Single-part (stateless) cryptographic operations are load-balanced. Multi-part (stateful) commands that involve cryptographic operations are load-balanced.

However, key management commands and multi-part (stateful) commands that involve information retrieval are not load-balanced. Key management commands affect the state of the keys stored in the HSM. As such, these commands are targeted at all HSMs in the group. The command is performed on the primary HSM and then the result is replicated to all members in the HA group. Key management operations are infrequent for most applications. Multi-part operations carry cryptographic context across individual commands. The cost of distributing this context to different HA group members is generally greater than the benefit. For this reason multi-part commands are all targeted at the primary member. Multi-part operations are infrequent actions, so most applications are not affected by this restriction.

HA groups shared across servers

When an HA group is shared across many servers, different initial members can be selected while the HA group is being defined on each server. The member first assigned to each group becomes the primary. This approach optimizes an HA group to distribute the key management and/or multi-part cryptographic operation load more equally.

Standby Mode

By default all members in an HA group are treated as active. They are kept current with key material and used to load-balance cryptographic services. In some deployment scenarios it makes sense to define some members as standby. In this mode, only the active units are used for active load-balancing. However, as key material is created they are automatically replicated to both the active units and standby unit. In the event of a failure of all active members, the standby unit is automatically promoted to active status.

The primary reason for using this feature is to reduce costs while improving reliability. This approach allows remote HSMs that have high latency to be avoided when not needed. However, in the worst case scenario where all the active HSMs fail, the standby member automatically activates itself and keeps the application running.

Failover

A failover event involves dropping a device from the available members in the HA group. All commands that were pending on the failed device are transparently rescheduled on the remaining members of the group. When a failure occurs, the application experiences a latency stall on some of the commands in process (on the failing unit) but otherwise sees no impact on the transaction flow. The least-busy scheduling algorithm automatically minimizes the number of commands that stall on a failing unit during the twenty second timeout.

Lengthy commands

Most commands are completed within milliseconds. However, some commands can take extended periods to complete – either because the command itself is time-consuming (for example, key generation); or because the device is under extreme load. To cover these events the HSM automatically sends “heartbeats” every two seconds for all commands that have not completed within the first two seconds. The twenty second timer is extended every time one of these heartbeats arrives at client, thus preventing false failover events.

Failure of the primary unit

If the primary unit fails, clients automatically select the next member in the group as the new primary. Any key management or single-part cryptographic operations are transparently restarted on a new group member. If the primary unit fails, any in-progress, multi-part, cryptographic operations must be restarted by the application.

As long as one HA group member remains functional, cryptographic service is maintained to an application no matter how many other group members fail.

Recovery

After a failure, the recovery process is straight-forward. Depending on the deployment, you can employ an automated or manual recovery process. In either case there is no need to restart an application.

Automatic Recovery

With automatic recovery, the client library automatically performs periodic recovery attempts while a member is failed. The frequency of these checks is adjustable.

The application does not restart.

Most customers enable auto-recovery in all configurations.

Manual Recovery

Simply run the client recovery command and the recovery logic inside the client makes a recovery attempt the next time the application uses the HSM. As part of recovery, any key material created while the member was offline is automatically replicated to the recovered unit.

Even if a manual recovery process is selected, the application does not need to be restarted.

Permanent failure

Sometimes a failure of a device is permanent. In this event, you only need to remove the failed unit and deploy a new member to the group. The running clients automatically resynchronize keys to the new member and start scheduling operations to it.

Requirements

The SafeNet HA and load-balancing features work on per-client and per-partition bases. This provides a lot of flexibility. For example, it is possible to define a different sub-set of HSMs in each client and even in each client’s partitions (in the event that a single client uses multiple partitions). SafeNet recommends to avoid these complex configurations and to keep the HA topography uniform for an entire HSM. That is, treat HSM members at the HSM level as atomic and whole.

Network topography

The network topography of the HA group is generally not important to the proper functioning of the group. As long as the client has a network path to each member the HA logic will function. Keep in mind that having a varying range of latencies between the client and each HA member causes a command scheduling bias towards the low-latency members. It also implies that commands scheduled on the long-latency devices have a larger overall latency associated with each command.

In this case, the command latency is a characteristic of the network. To achieve uniform load distribution, ensure that latencies to each device in the group are similar (or use standby mode).

Member Configuration and Version

All members in an HA group have the same configuration and version. Running HA groups with different versions is unsupported. HSMs are configured identically to ensure smooth high availability and load balancing operations.

SafeNet Luna HSMs come with various key management configurations: cloning mode, key-export mode, etc. HA functionality is supported with both cloning and SIM variants – provided all members in the group have the same configuration. Clients automatically and transparently use the correct secure key replication method based on the group’s configuration.

Physical and Virtual Slots

By default the client library presents both physical slots and virtual slots for the HA group. Directing applications at the physical slots bypasses the high availability and load balancing functionality. An application must be directed at the virtual slots to activate the high availability and load balancing functionality. A configuration setting referred to as HAonly hides the physical slots. SafeNet recommends using this setting to prevent incorrect application configurations. Doing so also simplifies the PKCS#11 slot ordering given a dynamic HA group.

Application developers should be aware that the PKCS#11 object handle model is fully virtualized with the SafeNet HA logic. The application must not assume fixed handle numbers across instances of an application. A handle’s value remains consistent for the life of a process but it might be a different value the next time the application is executed.

For detailed instructions on setting up HA, see the Administration Guide.