HA Recovery

HA recovery is hands-off resumption by failed HA Group members, or it is manual re-introduction of a failed member if "autorecovery" has not been switched on. Some reasons for a member to fail from the group might be:

- the appliance loses power (but regains power in less than the 2 hours that the HSM preserves its activation state)

- the network link from the unit is lost and then regained.

HA recovery takes place if:

HA autorecovery is enabled, or if you detect a unit failure and manually re-introduce the unit (or its replacement)
HA group has at least 2 nodes
HA node is reachable (connected) at client startup
HA node recover retry limit is not reached. Otherwise manual recover is the only option to bring back the downed connection(s)

If all HA nodes fail (no links from client) no recovery is possible.

The HA recovery logic in the library makes its first attempt at recovering a failed member when your application makes a call to its HSM (the group). That is, an idle client does not start the recovery-attempt process.

On the other hand, a busy client would notice a slight pause every minute, as the library attempts to recover a dropped HA group member (or members) until the member has been reinstated or until the timeout has been reached and it stops trying. Therefore, set the number of retries according to your normal situation (the kinds and durations of network interruptions you experience, for example).

HA Autorecovery vs Manual Recovery

Autorecovery is not on by default. It must be explicitly enabled with vtl haAdmin -autorecovery command.

Use manual recovery whenever you have multiple processes or clients sharing a partition. Using automatic recovery with multiple processes sharing a partition could lead to a collision.

For practical steps to replace a failed HA group member, see "HA Replacing a Failed Luna SA".