Home > |
---|
When your application is using a Luna HSM HA group, your application appears to be using just one HSM – the virtual or group HSM that hides/encompasses the HA group members. Your client should not attempt to directly address any partition on any Luna HSM within the HA group. This defeats the purpose of HA, and can cause disruption if you/your application attempts to change anything on just one member of a synchronized group. Similarly, no other application or user should be permitted to address any of the HA group members individually. As long as your application addresses its requests to the virtual group Partition, the HA functionality takes care of all activity in transparent fashion.
The intent of Luna SA HA is to provide the following:
•load balancing
•operational redundancy such that if an appliance fails (or must be taken off-line for other reasons) the remaining appliances can continue to provide service to the Client until the failed/removed appliance (or a replacement unit) can be brought into the HA group.
Note: Client tools such as lunacm allow you to view partitions on your HSM, including the virtual group. While some commands within lunacm may appear to support management functions on the partitions of the virtual group, as stated above, you should not use these commands as they defeat the purpose of HA and are unsupported for this purpose.
•In HA mode, if an HSM appliance goes off-line/drops-out (due to failure, maintenance, or other reason), the application load is spread over the remaining HSM Partitions on appliances in the HA Group.
•When the unit is restarted, the application does not need be stopped and restarted, before the re-introduced unit can be used by the application.
•For the unit that was withdrawn (or for a replacement unit), if it was powered off for more than a short outage, you must re-Activate the Partitions before they can be re-included into the HA Group.
The following two re-connection scenarios are available:
1.Restart the failed member and verify that it has started properly.
2.Do NOT
perform a manual re-synchronization between the members. Instead, use
the “vtl” command:
vtl haadmin -recover -group <GROUP NAME>
1.Configure the
new Luna SA naming
it DIFFERENTLY(the
name must be different to avoid any possibility of conflict between the
old and new SSL certificates, which incorporate the hostnames of the respective
appliances)
from the
failed member appliance and making it part of the same cloning domain
as others in the HA group (at initialization, get its cloning domain from
the same red domain PED Key).
If you require that the replacement appliance must have the same name
as the replaced appliance, then you will need to stop your application
before introducing the new appliance.
2.Create a partition with the same characteristics as others in the HA group ( password, autoActivation, auto MofN, client assignments, etc.).
3.Do NOT delete the failed Luna SA member from the configuration file.
4.Determine the serial number of the failed member partition.
5.Remove the failed
member from the HA group using the "vtl" command:
vtl haadmin -removeMember -group <groupNumber> -serialNum <serialnumber>
-password <password>
6.Retrieve the server certificate of the new Luna SA.
7.Replace the failed
Luna SA with the new one using the "vtl" command:
vtl replaceServer -o <oldServerName> -n <newServerName>
-c <newServerCertFile>
8.Add the new partition
of the new Luna SA to the HA group using the "vtl" command:
vtl haadmin -addMember -group <group number> -serialNum <serialnumber>
-password <password>
9.Do NOT
perform a manual re-synchronization between the members. Instead, use
the “vtl” command:
vtl haadmin -recover -group <GROUP NAME>
See "HA Replacing a Failed Luna SA" for more discussion of replacing or re-introducing members to an existing HA group.
For Luna SA HA function we suggest that all Luna SA appliances in an HA group be at the same appliance software and firmware level. The issue is not about firmware level, per se - what might happen is that a newer firmware could contain newer algorithms that are not supported in the replaced firmware. If your client is configured to take advantage of newer/better algorithms when they become available, it might do so while one member of an HA group has new firmware, but another member has not yet been updated, and therefore does not yet support the requested algorithm. The client might not be able to interpret the resulting imbalance. Therefore, when you intend to upgrade/update any of the Luna SA units in an HA group, or when you intend to upgrade/update the Luna SA Client software, you might schedule some downtime for your application, if you anticipate a problem.
If the application is so critical that you cannot permit that much scheduled downtime, then you can set up a second complete set of Client computer and associated HA group. One set can service the application load while the other set is being upgraded or otherwise maintained. For such up-time-critical applications, you might already have such a backup set of Client-plus-HA-group that you would rotate in and out of service during regular maintenance windows.
Due to a problem in the TCP/IP configuration of some Solaris systems, inconvenient delays may have been experienced with some Solaris clients.
The problem occurred if an application was started on a Solaris client while one or more expected Luna SA appliances is unavailable. The Solaris client machine experienced a considerable delay (minutes) before the remaining Luna SAs could be seen and used by the application. This was a TCP/IP setup issue in Solaris, in which the system attempted to set up sockets for each expected connection, and retried the unsuccessful attempts until timeout, before permitting successful connections to proceed.
To control this problem, the client-side library now imposes a ten-second retry window per expected appliance, and then moves on. (Thus, if your Client was configured to use three Luna SA appliances, and two of them were unavailable, the Client would retry the first missing appliance for ten seconds, then the second missing appliance for a further ten seconds, for a total of twenty seconds of retries, before resuming operation with the remaining available appliance). This applies to Linux and Unix variants.
For Windows, the per-appliance timeout is 24 seconds.
If you create an object on your HA slot, and then duplicate that object in some fashion (for example, by SIM'ing [wrapping] it off and then back on again, or performing a backup/restore with the 'add' option), that object will be seen as only one object on the HA slot because HA uses the object's fingerprint to build an object list. Two objects will in fact exist on each of the physical slots and could be seen by a non-HA utility/query to the HSM.
There are TWO implications from this situation:
•One implication is that repeated duplication (perhaps an application
that performs periodic backups, and restores using the 'add' option rather
than 'replace') could cause the Partition to reach the maximum number
of Partition objects while seemingly having fewer objects.
If the system ever tells you that your Partition is full, but HA says
otherwise, then use a tool like ckdemo that can view the "physical"
slots directly (as opposed to the HA slot) on the HSM, and delete any
objects that are unnecessary.
•A second implication is that the HA feature uses object fingerprints to match different instances of an object on different physical HSMs. This can result in error messages if your application does not properly create and destroy session objects, and perhaps creates an object identical to one which has been removed in a separate concurrent session. The problem is self-correcting, but the flurry of error messages could be worrying if you don't understand where they are coming from.
While it is possible to have HSMs with different firmware versions within an HA group, this is not generally recommended. Be aware that the capability of the group (in terms of features and available algorithms) is that of the member with the oldest firmware.
For example, if you had an HA group that included HSMs with two different firmware versions, then certain capabilities that are part of the newer firmware would be unavailable to Clients connecting to the HA group. Specifically, operations that make use of newer cryptographic mechanisms and algorithms would likely fail. The client's calls might be initially assigned to a newer-firmware HSM and could therefore appear to work for a time, but if the task was load-balanced to an HSM that did not support the newer features it would fail. Similarly, if the newer-firmware HSM dropped out of the group, operations would fail. Your Clients must not invoke those algorithms because not every member of the group supports them. The solution is to upgrade the older units to the most recent firmware and software versions (where possible) or else to limit clients to only the lowest supported feature set.
Luna SA 5.x in HA can provide performance improvement for asymmetric single-part operations. Gigabit ethernet connections are recommended to maximize performance. For example, we have seen as much as a doubling of asymmetric single-part operations in a two-member group in a controlled laboratory environment (without crossing subnet boundaries, without competing traffic or other latency-inducing factors).
Multi-part operations are not load-balanced by the Luna HA due to the overhead that would be needed to perform context replication for each part of a multi-part operation.
Single-part cryptographic operations are load-balanced by the Luna HA functionality under most circumstances (see note on PE1746crypto integrated circuit within the K6 HSM (the stand-alone Luna PCI-E, and the HSM inside the Luna SA appliance.Enabled setting). Load-balancing these operations provides both scalability (better net throughput of operations) and redundancy by supporting transparent fail-over.
The Luna client accepts a configuration file entry known as “PE1746Enabled”. This configures the way Luna HSM handles symmetric encryption and decryption operations for certain algorithms – namely ECB and CBC modes of AES and TDES. By default (beginning with release 5.4) an entry is always present in the [Misc] section of the configuration file, and its value is set to “PE1746Enabled=0”, or unset.
To set this configuration option, “PE1746Enabled=1”.
When set, this value configures the library to use fast-path cryptography directly to symmetric encryption engines. This has the advantage of enabling high performance bulk crypto performance, but has the disadvantage of creating a direct context between the client library and the engine. This means that the library cannot easily load-balance operations across HSMs. This mode should be used only by applications that perform large data encryption operations (>1K data sizes).
When PE1746Enabled=0, the library uses its standard command path to the HSM. The advantage of this is that all single-part cryptographic operations can be load-balanced. The disadvantage is lower performance for larger data sizes. Applications should maintain this setting whenever possible to ensure the scalability and fail-over advantages.
In summary:
•when PE1746Enabled=1 load-balancing is not used for symmetric cryptographic operations; instead all symmetric operations are directed at the client’s primary member -- you see better performance, but no scalability across HSMs.
•when PE1746Enabled=0 all single-part cryptographic operations (with data size less-than-or-equal-to 1K ) are load-balanced.
A single-part crypto operation is typically one that has small data sizes (< 1Kb), but is also dependent on how the library makes its API calls (PKCS #11 supports explicit multi-part API calls through the use of C_EncryptUpdate and C_DecryptUpdate). When an application uses the “Update” APIs the cryptographic operation is, by definition, multi-part. When the application does not use these APIs (i.e. uses C_EncryptInit followed by C_Encrypt) then an operation is single-part up to a 64KB data size.
Additionally, the HSM has a limit of 1000 contexts for SafeXcel 1746 operations, which is a consideration when many client threads are involved, and depends upon the number of concurrent threads. (LHSM-12630)
Whenever possible, run your application with PE1746Enabled=0.
If all members of an HA group were to fail, then all logged-in sessions are gone, and operations that were active when the last group member went down, are terminated. It is not currently possible from the Luna SA perspective to resume the client's state unassisted when the members are restarted. However, if the client application is able to recover all that state information, then it is not necessary to restart or re-initialize in order to resume client operations with the Luna SA HA group. See "HA Recovery ".
In any one HA group, always ensure that member partitions or member PKI tokens (USB-attached Luna G5 HSMs, or Luna CA4/PCM token HSMs in a USB-attached Luna DOCK2 card reader) are on different / separate appliances. Do not attempt to include more than one HSM partition or PKI token (nor one of each) from the same appliance in a single HA group. This is not a supported configuration. Allowing two partitions from one HSM, or a partition from the HSM and an attached HSM (as for PKI), into a single HA group would defeat the purpose of HA by making the Luna appliance a potential single-point-of-failure.
The client-side utility command "vtl listslot" shows all detected slots, including HSM partitions on the primary HSM, partitions on connected external HSMs, and HA virtual slots. Here is an example:
bash-3.2# ./vtl listslot
Number of slots: 11
The following slots were found:
Slot # Description Label Serial # Status
slot #1 LunaNet Slot - - Not present
slot #2 LunaNet Slot sa76_p1 150518006 Present
slot #3 LunaNet Slot sa77_p1 150475010 Present
slot #4 LunaNet Slot G5179 700179008 Present
slot #5 LunaNet Slot pki1 700180008 Present
slot #6 LunaNet Slot CA4223 300223001 Present
slot #7 LunaNet Slot CA4129 300129001 Present
slot #8 HA Virtual Card Slot - - Not present
slot #9 HA Virtual Card Slot - - Not present
slot #10 HA Virtual Card Slot ha3 343610292 Present
slot #11 HA Virtual Card Slot G5_HA 1700179008 Present
Note: - The deploy/undeploy of a PKI device increments/decrements the Luna SA client slot enumeration list (slots appear or disappear from the list, and the slot numbers adjust for the change). HA group virtual slots always appear toward the end of the list, following the physical slots. The actual slot number can vary based on the currently connected external HSMs (tokens, G5).
Due to the above behavior, we generally recommend that you run the lunacm:> haGroup haonly
command, or the vtl haAdmin HAOnly enable
command, so that only the HA slot is visible and any confusion or improper slot use is eliminated.
If your situation requires that some HA group members be active, while others are kept synchronized, but in standby mode, see "HA Standby [optional]".
The following questions, in no particular order, represent queries that have come in from customers and from our own technical representatives in the field, about specific aspects of the workings of HA. Some are from potential customers determining whether Luna HSM appliances meet their needs.
When an HA Group member first fails, the HA status for the group shows "device error" for the failed member. All subsequent calls return "token not present", until the member (HSM Partition or PKI token) is returned to service.
The default behavior of the client library is to attempt to load-balance the application’s cryptographic requests across the entire set of devices in the HA group. The top level algorithm is a round-robin scheme that is modified to favor the least busy device in the set.
As each new command is processed the Luna client looks at how many commands it has scheduled on every device in the group. If all devices have an equal number of outstanding commands the new command is scheduled on the next device in the list – creating a round-robin behavior. However, if the devices have a different number of commands outstanding on them, the new command is scheduled on the device with the fewest commands queued – creating a least-busy behavior. This modified round-robin has the advantage of biasing load away from any device currently performing a lengthy-command. In addition to this least-busy bias, the type of command also affects the scheduling algorithm.
Single-part (stateless) cryptographic operations are load-balanced. However, multi-part (stateful) and key management commands are not load-balanced. Multi-part commands would need to carry cryptographic context across the individual command parts, creating overhead. Instead multi-part commands are all targeted at the primary member.
Multi-part operations and key management operations are infrequent actions, so most applications are not affected by this restriction.
HA might not load-balance or might fail to perform fail-over properly, if replication is turned off.
This is not possible. The certificates used in replication/synchronization are not compatible. You can migrate objects from your older appliances/HSMs, but you cannot run Luna SA 4.x and Luna SA 5.x concurrently in a single HA group.
SIMM replication is supported.
HA will work, but key replication must be performed manually...so key creation in such an environment will fail to replicate.
You can, but you CANNOT clone/replicate private keys
Automatic reintroduction is supported.
A failed (and fixed, or replacement) HSM appliance can be re-introduced if the application continues without restart. Restarting the application causes it to take a fresh inventory of available HSMs, and to use only those HSMs within its HA group. You cannot [re]introduce a Luna SA that was not in the group when the application started.
Synchronization of token objects is a manual process using the “vtl” utility. Synchronization locates any object that exists on any one physical HSM partition (that is a member of the HA group), but not on all others, and replicates that object to any partitions (among the group) where it did not exist.
This is distinct from the replication that occurs when you create or delete an object on the HA virtual slot. Creation or deletion against the virtual slot causes that change to be immediately replicated to all connected members (addition OR deletion).
To illustrate, consider a group of three HSMs with partitions containing objects as follows:
Members | ||||
---|---|---|---|---|
HSM 1 Partition A | HSM 2 Partition B | HSM 3 Partition C | HA virtual slot | |
Action | ||||
Objects in partition before synchronization |
Keypair 1, Key 1, Certificate 1 | Keypair 1, Key 1, Certificate 2 | Key 1, Key 2, Certificate 2 | Keypair 1, Key 1, Certificate 1 |
Perform manual re-sync. Objects in physical partition after synchronization (Note all partitions contain all objects) |
Keypair 1, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Key 1, Key 2, Certificate 1, Certificate 2 |
Create Keypair 2 against the HA slot (replication is immediate and automatic - all partitions contain all objects) |
Keypair 1, Keypair 2, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Key 1, Key 2, Certificate 1, Certificate 2 |
Create Keypair 3 against physical Partition A (no replication or sync occurs) |
Keypair 1, Keypair 2, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 (* This would be the case ONLY if partition A is the primary member of the HA group - otherwise, the created object is NOT visible in the virtual slot until re-sync) |
Perform manual re-sync Objects in partition after synchronization |
Keypair 1, Keypair 2, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 2, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 |
Delete Keypair 2 from the HA slot (replication is immediate and automatic - Keypair 2 is gone from all partitions) |
Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 |
Delete Keypair 1 from physical Partition B (no replication or sync happens) |
Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 |
Perform manual re-sync Objects in partition after synchronization (note Keypair 1 has returned to Partition B) |
Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 | Keypair 1, Keypair 3, Key 1, Key 2, Certificate 1, Certificate 2 |
You can't create any more objects on it. Some scenarios are just what they seem and have no bearing on HA, in particular...this is one of them.
Session objects, as well as token objects, are synchronized and replicated.
Multi part operations do NOT fail over. The entire operation returns a failure. Your application deals with the failure in whatever way it is coded to do so.
Any operation that fails mid-point would need to be resent from the calling application. That is, if you don’t receive a ‘success’ response, then you must try again. This is obviously more likely to happen in a multi-part operation because those are longer, but a failure could conceivably happen during a single atomic operation as well.
With HA, if the library attempts to send a command to an HSM and it is unavailable, it will automatically retry sending that command to the next HSM in the configuration after the timeout expires.
Multi-part operations would typically be block encryption or decryption, or any other command where the previous state of the HSM is critical to the processing of the next command. It is understandable that these need to be re-sent since the HSMs do not synchronize ‘internal memory state’ … only stored key material.
“Ntls show”.
CA extension call “CA_GetHAState” lists all active. Also “vtl l” lists members.
While the client application is active, and one HA group member is connected and active, other members can automatically resume in the HA group as long as retries have not stopped (see below).
If all members fail or if the client does not have a network connection to at least one group member, then the client application must be restarted.
The more detailed answer is "it depends..." - various events and processes interact at different levels and in different situations. Here is a summary.
At the lowest communication level, the transport protocol (TCP) is responsible for making and operating the communication connection between client and appliance (whether HA is involved or not). For Luna SA, the default protocol timeout of 2 hours was much too long, so SafeNet configured that to 3 minutes when HA is involved. This means that:
•In a period of no activity by client or appliance, the appliance's TCP will wonder if the client is still there, and will send a packet after 3 minutes of silence.
•If that packet is acknowledged, the 3 minute TCP timer restarts, and the cycle repeats indefinitely.
•If the packet is NOT acknowledged, then TCP sends another after approximately 45 seconds, and then another after a further 45 seconds. At the two minute mark, with no response, the connection is considered dead, and higher levels are alerted to perform their cleanup.
So altogether, a total of five minutes can elapse since the last time the other participant was heard from. This is at the transport layer.
Above that level, the NTLS layer provides the connection security and some other housekeeping. Any time a client sends a request for a cryptographic operation, the HSM on the appliance begins working on that operation.
While the HSM processes the request, appliance-side NTLS sends a "keep-alive PING" every two seconds, until the HSM returns the answer, which NTLS then conveys across the link to the requesting client. NTLS (nor any layer above) does not perform any interpretation of the ping.
It simply drops a slow, steady trickle of bytes into the pipe, to keep the TCP layer active. This normally has little effect, but if your client requests a lengthy operation like (say) an 8192-bit keygen, then the random-number-generation portion of that operation could take many minutes to complete, during which the HSM would legitimately be sending nothing back to the client. The NTLS ping ensures that the connection remains alive during long pauses.
In the Luna configuration file, "DefaultTimeout" (default value is 500 seconds) governs how long the client will wait for a result from an HSM, for a cryptographic call. In the case of Luna SA, the copy of the config file inside the appliance is not accessible externally. The config file in the client installation is accessible to modify, but "DefaultTimeout" in that file affects only a locally connected HSM (such as might be the case if you had a Luna Remote Backup HSM attached to your client computer). The config file in the client has no effect on the configuration inside the network-attached Luna SA appliance, and thus can have no effect on the interaction between client and Luna SA appliance.
ReceiveTimeout is how long the library will wait for a dropped connection to come back.
If the ReceiveTimeout is tripped, for a given appliance, the HA client stops talking to that appliance and deals with the remaining members of the HA group to serve your application's crypto requests.
A minute later, the HA client tries to contact the member that failed to reply.
If the connection is successfully re-established, the errant appliance resumes working in the group, being assigned application calls as needed (governed by application workload and HA logic).
If the connection is not successfully re-established, the client continues working with the remaining group members. Another minute passes, and the client once again tries the missing appliance to see if it is ready to actively resume working in the HA group.
The retries continue until the missing member resumes, or until the pre-set (by you) number of retries is reached (maximum of 500). If the retry count is reached with no success, the client stops trying that member. The failed appliance is still a member of the group (it is still in the list of HA group members maintained on the client), but the client no longer tries to send it application calls, and no longer encourages it to establish a connection. You must fix the appliance (or its network connection) and manually recover it into the group for the client to resume including it in operations.
This is set when you enable the feature through vtl. You can try once per minute, up to 500 minutes.
Logic is built into HA client code.
HA allows for varying configurations. HA is purely at the client level. Your use of it depends on your creativity and understanding of your group members. In short, any one Luna SA can be part of any number of HA groups. The Luna SA itself has no concept that it is working in HA.
A given Luna SA HSM can have multiple partitions. There is no contention if the HSM that is common to two different HA groups has two separate partitions, each partition being used by only one of the HA groups. Problems can arise if a partition on one HSM is a member of more than one active HA group.
Consider two HA groups consisting of (say) appliances A and B in group One and (say) appliances B and C in group Two. For simplicity, assume that all three HSMs have only a single partition each, so that the partition on appliance B is a member of both groups. The way in which an object would be propagated between the two groups is that it would be created in one group. The proper way to create an object in an HA group, of course, is to address the request/command to the virtual slot, the HA slot. "Behind the scenes", the new object is created on the primary member, and is immediately, automatically propagated to the other member(s) of the group. To be more specific, a request to generate a key would be issued by the client application. The request would be captured by the HA software on the client, which would try the first physical member in the HA list, as specified in the configuration file. If that operation proceeds successfully, the resulting new key is propagated to the other members of the HA group - if replication is turned on. If the keygen operation fails with a timeout - the primary HA member is busy and does not finish its current operation before the client times-out the keygen request - then the client drops the request against that member and tries the next physical member of the HA group.
Alternatively, and usually not recommended, you could create an object directly on the physical slot of one member or the other. No synchronization is triggered, because you did not perform the operation via the group (virtual) slot.The next time you issue a manual synchronization request for that group, the synchronization happens.
Once synchronization/replication occurs, the AB group has the new object on both members, either because you created it in the group virtual slot and it was automatically replicated, or because you created it improperly, directly on one of the physical HSM slots, and performed a manual replication.
Next, you can perform a manual synchronization of group BC, to replicate the object within that group.
This would be a workable scenario if your practice was to keep a "master" HA group, and to temporarily make a member of that group also a member of a separate, unpopulated HA group that you were preparing to deploy. Immediately after propagating the required objects to the "child" group, you would break the connection (by changing the HA configuration of the second group to no longer include HSM B), so that the child group existed independently of the master group and was no longer affected by it. This could be repeated for each new HA group that you wished to deploy. You would add fresh (unpopulated members to the separated child group and perform synchronization. The child group members would then take on the contents that existed in HSM C at the time HSM B (the member formerly common to both groups) was removed from the child group. You would now have two completely independent HA groups, parent and child, with identical contents.
Where a problem might arise is if you choose to not break the connection between the original [or master or "golden"] group and the other groups.
You can propagate one or more objects, using HA, but you cannot propagate deletion if all members of the current group are not present at the time of the deletion. This could show up in two scenarios:
•If a group member is off-line for any reason when an object is deleted from the group, the missing member does not have the deletion, and therefore retains the object. If that member later rejoins the group, the next synchronization causes the unwanted item to be replicated from the rejoined member to the rest of the group, undoing the deletion.
•If different groups remain connected by a common member, they could continue to replace/restore objects that you delete from any one group, as long as the object exists in the other shared group. This could result in the proliferation of outdated crypto objects that you do not wish to retain.
The original question (above) was asked in the context of using HA to replicate objects to distant HSMs. Thus, at least one member of the target group would be remotely located. This is perfectly fine in the HA context, except that network/Internet latency would likely have an impact on performance.- another reason to minimize situations where an HSM is a member of more than one HA group.
The client library drops the member and continues with others. It will try to reconnect that member at a minimum retry rate of once per minute (configurable) for the number of times specified in the configuration file, and then stop trying that member. You can specify a number of retries from 3 to an unlimited number.
You must manually remove a member using “vtl”. If the device cannot be contacted, the HA client merely stops trying it (see "retries" in the previous question), but the device remains a group member until manually removed.
No, you cannot. Think of starting the application as starting a race. You cannot add in a new runner once the race is already under way. But, if you restart the race, you can.
The “haonly” flag shows only HA slots (virtual slots) to the client applications. It does not show the physical slots. We recommend that you use "haonly", unless you have particular reason for not using it. Having "haonly" set is the proper way for clients to deal with HA groups - it prevents the possible confusion of having both physical and virtual slots available.
Recall that automatic replication/synchronization across the group occurs only if you cause a change (keygen or other addition, or a deletion) via the virtual HA slot. If you/your application changes the content of a physical slot, this results in the group being out-of-sync, and requires a manual re-sync to replicate a new object across all members. Similarly, if you delete from a physical slot directly, the next manual synchronization will cause the deleted object to be repopulated from another group member where that object was never deleted. Therefore, to perform a lasting deletion from a single physical slot (if you choose not to do it via the virtual slot) requires that you manually delete from every physical slot in the group, or risk your deleted object coming back.
Also, from the perspective of the Client, a member of the HA group can fail and, with "haonly" set, the slot count does not change. If "haonly" is not set, and both virtual and physical slots are visible, then failure of unit number 1 causes unit number 2 to become slot 1, and so on. That could cause problems if your application is not designed to deal gracefully with such a change.
This is as designed. You originally had your application running with X number of members. One failed, but was not removed from the group, so retries were occurring, but the application was operating with X-1 members available. Then you restarted. When the application came up after that restart, it saw only X-1 members. Having just started, it now has no notion that the Xth member exists. The "race" has restarted with X-1 runners. You cannot add to that number within an application. To go from the number that the application now recognizes, X-1, to the new, larger number of participants X-1 +1 (or X), you must restart the application (the race) while all X members (runners) are available.
Yes. Fail-over and recovery HA logic are invoked.
In the following case, it is not so much that we actively support the scheme explicitly as that we did what we could, within the limits of Luna SA HA, to keep our hardware and software from hindering what the customer wished to accomplish.
Consider a customer with several Luna SA appliances in HA, serving an application that enables point-of-sale transactions. Necessarily, the customer wanted maximum possible up-time. The problem was to maintain such up-time even if all the members of the HA group went off-line, or needed maintenance, etc.
The eventual deployed scheme involved TWO HA groups - the principal operational group and a backup group.
To ensure uninterrupted up-time, the backup group included all the members of the principal group. The customer provided the fail-over mechanism to switch groups.