HA Operational Notes

Open topic with navigation

Home >	Administration Guide > High Availability (HA) Mode > HA Operational Notes

HA Operational Notes

When your application is using a Luna G5 HA group, your application appears to be using just one HSM – the virtual or group HSM that hides the HA group members. Your client should not attempt to directly address any partition on any Luna G5 within the HA group. This defeats the purpose of HA, and can cause disruption if you/your application attempts to change anything on just one member of a synchronized group. Similarly, no other application or user should be permitted to address any of the HA group members individually. As long as your application addresses its requests to the virtual group Partition, the HA functionality takes care of all activity in transparent fashion.

The intent of Luna G5 HA at this time is to provide :

•load balancing

•operational redundancy such that if aunit fails (or must be taken off-line for other reasons) the remaining HSMs can continue to provide service to the Client application until the failed/removed HSM (or a replacement unit) can be brought into the HA group.

Load Balancing

Load balancing is supported for single-part operations like sign, verify, encrypt, decrypt. For multi-part operations, the operation is performed in the primary Luna G5's partition and the results are then cloned to the other member(s) of the HA group.

Reconnecting an Off-line Unit

•In HA mode, if a Luna G5 goes off-line/drops-out (due to failure, maintenance, or other reason), the application load is spread over the remaining Luna G5 Partitions in the HA Group.

•When the unit is restarted, the application does not need to be stopped and restarted, before the re-introduced unit can be used by the application.

•For the unit that was withdrawn (or for a replacement unit), you must re-Activate the Partition before it can be re-included into the HA Group.

Replace a Failed Group Member with a New Luna G5

1.Configure the new Luna G5, making it part of the same cloning domain as others in the HA group (at initialization, get its cloning domain from the same red domain PED Key).
If you require that the replacement appliance must have the same name as the replaced appliance, then you will need to stop your application before introducing the new appliance.

2.Create a partition with the same characteristics as others in the HA group .

3.Determine the serial number of the failed member partition.

4.Remove the failed member from the HA group using the "lunacm" command:
lunacm:> haGroup -removeMember -group <groupNumber> -serialNum <serialnumber> -password <password>

5.Add the new partition of the new Luna G5 to the HA group using the "lunacm" command:
lunacm:> haGroup -addMember -group <group number> -serialNum <serialnumber> -password <password>

6.Perform a manual re-synchronization between the members using the “lunacm” command:
lunacm:> haGroup -synchronize -group <GROUP NAME>

Upgrading and Redundancy and Rotation

The Luna G5 HA function assumes that all Luna G5 appliances in an HA group are at the same appliance software and firmware level. Therefore, when you intend to upgrade/update any of the Luna G5 units in an HA group, or when you intend to upgrade/update the Luna G5 Client software, schedule some downtime for your application.

If the application is so critical that you cannot permit that much scheduled downtime, then you can set up a second complete set of Client computer and associated HA group. One set can service the application load while the other set is being upgraded or otherwise maintained. For such uptime-critical applications, you would likely already have such a backup set of Client-plus-HA-group that you would rotate in and out of service during regular maintenance windows.

Using Algorithms or Features in a Mixed HA Group

While it is possible to have HSMs with different firmware versions within an HA group, this is not generally recommended. Be aware that the capability of the group (in terms of features and available algorithms) is that of the member with the oldest firmware.

For example, if you had an HA group that included an HSM with two different firmware versions, then certain capabilities that are part of the newer firmware would be unavailable to Clients connecting to the HA group. Specifically, operations that make use of newer cryptographic mechanisms and algorithms would likely fail. The client's calls might be initially assigned to a newer-firmware HSM and could therefore appear to work for a time, but if the task was load-balanced to an HSM that did not support the newer features it would fail. Similarly, if the newer-firmware HSM dropped out of the group, operations would fail. Your Clients must not invoke those algorithms because not every member of the group supports them. The solution is to upgrade the older units to the most recent firmware and software versions (where possible) or else to limit clients to only the lowest supported feature set.

Frequently Asked Questions

If a partition becomes full, what happens?

You can't create any more objects on it. Some scenarios are just what they seem and have no bearing on HA, in particular...this is one of them.

Are session objects replicated or only token objects?

Session objects, as well as token objects, are synchronized and replicated.

What happens to an application if a device fails mid-operation? What if it’s a multi-part operation?

Multi-part operations do NOT fail over. The entire operation returns a failure. Your application deals with the failure in whatever way it is coded to do so.

Any operation that fails mid-point would need to be resent from the calling application. That is, if you don’t receive a ‘success’ response, then you must try again. This is obviously more likely to happen in a multi-part operation because those are longer, but a failure could conceivably happen during a single atomic operation as well.

With HA, if the library attempts to send a command to an HSM and it is unavailable, it will automatically retry sending that command to the next HSM in the configuration after the timeout expires.

Multi-part operations would typically be block encryption or decryption, or any other command where the previous state of the HSM is critical to the processing of the next command. It is understandable that these need to be re-sent since the HSMs do not synchronize ‘internal memory state’ … only stored key material.

How many times, or for how long will a device be polled to be automatically reintroduced?

This is set when you enable the feature. You can try once per minute, up to 500 minutes.

How does the automatic reintroduction work? Why does it need a partition policy?

Logic is built into HA client code.

At the library level, what happens when a device fails or doesn’t respond?

The client library drops the member and continues with others. It will try to reconnect that member at a minimum retry rate of once per minute (configurable) for the number of times specified in the configuration file, and then stop trying that member. You can specify a number of retries from 3 to an unlimited number.

Under what circumstances will a device be moved out of an HA group - only in the event it cannot be contacted?

You must manually remove a member. If the device cannot be contacted, the HA client merely stops trying it (see "retries" in the previous question), but the device remains a group member until manually removed.

Can you add and remove devices to a HA group without restarting the application? If so what caveats apply?

No, you cannot. Think of starting the application as starting a race. You cannot add in a new runner once the race is already under way. But, if you restart the race, you can.

What is the impact of the ‘haonly’ flag, and why might you wish to use it? .

The “haonly” flag shows only HA slots (virtual slots) to the client applications. It does not show the physical slots. We recommend that you use "haonly", unless you have particular reason for not using it. Having "haonly" set is the proper way for clients to deal with HA groups - it prevents the possible confusion of having both physical and virtual slots available.

Recall that automatic replication/synchronization across the group occurs only if you cause a change (keygen or other addition, or a deletion) via the virtual HA slot. If you/your application changes the content of a physical slot, this results in the group being out-of-sync, and requires a manual re-sync to replicate a new object across all members. Similarly, if you delete from a physical slot directly, the next manual synchronization will cause the deleted object to be repopulated from another group member where that object was never deleted. Therefore, to perform a lasting deletion from a single physical slot (if you choose not to do it via the virtual slot) requires that you manually delete from every physical slot in the group, or risk your deleted object coming back.

Also, from the perspective of the Client, a member of the HA group can fail and, with "haonly" set, the slot count does not change. If "haonly" is not set, and both virtual and physical slots are visible, then failure of unit number 1 causes unit number 2 to become slot 1, and so on. That could cause problems if your application is not designed to deal gracefully with such a change.

If an HA group member fails and an application restarts, it will not be possible to recover that device until you restart the application again. Why?

This is as designed. You originally had your application running with X number of members. One failed, but was not removed from the group, so retries were occurring, but the application was operating with X-1 members available. Then you restarted. When the application came up after that restart, it saw only X-1 members. Having just started, it now has no notion that the Xth member exists. The "race" has restarted with X-1 runners. You cannot add to that number within an application. To go from the number that the application now recognizes, X-1, to the new, larger number of participants X-1 +1 (or X), you must restart the application (the race) while all X members (runners) are available.

Can a PED operation on one member of an HA group lock it out from operation (PED operations block cryptographic operations)? If so, will it automatically come back into use after the operation has concluded?

Yes. Fail-over and recovery HA logic are invoked.

What if HA does not recognize partition full?

Normally, this could happen only if you are performing operations directly on physical slots, rather than via the virtual slot. If the system ever tells you that your Partition is full, but HA says otherwise, then use a tool like ckdemo that can view the "physical" slots directly (as opposed to the HA slot) on the HSM, and delete any objects that are unnecessary.

Open topic with navigation