Backups

At some point, you may want to to make a backup of everything in Hadoop that is related to Data Discovery and Classification. Such a backup will include the HBase tables and all the generated files (.tar) with the information of the scans (HDFS).

In order to save the DDC data and create a backup you have to perform these two steps (separately):

Back up HDFS
Back up HBase
Note
HBase backup is only necessary if you use the Intelligent Remediation functionality or has some Legacy scans or reports from CipherTrust Manager 2.2. Otherwise, you can skip it.

Preparing for the Backup

Execute the df -h command in the hdfs directory to be copied and make sure that there is enough space in the destination location.

Note

It is not necessary to make the backup in each node, but it is good practice make the backup on the name node.
You need to have *roo
To run Export/Import commands, you need to switch to the hdfs user.

Backup Approaches

HDFS

The best way to create a HDFS backup on a different cluster is to use DistCp. Please refer to HDFS Backup/Restore.

HBase

There are different approaches to creating HBase backups. The recommended ones are:

Make and restore a snapshot: this is the recommended approach. Please refer to HBase Backup/Restore using Snapshots. For deeper details refer to Snapshots section in Cloudera blog "Approaches to Backup and Disaster Recovery in HBase".
Full Shutdown Backup (with a stopping of the service): it is possible to make a complete backup by stopping the services and using the distcp command. For details, see the HBase documentation on Full Shutdown Backup.
Export and Import Tables. Please refer to HBase Backup/Restore using Export. For more details refer to Export section in Cloudera blog "Approaches to Backup and Disaster Recovery in HBase".
Caution
This option is not a Clone of the table and inconsistencies could appear

For more information about these options, refer to the official Cloudera documentation.

HDFS Backup/Restore

The best way to create a HDFS backup on a different cluster is to use DistCp.

Create HDFS Backup

The most common use of DistCp is an inter-cluster copy:

hadoop distcp \
hdfs://nn1:8020/source \
hdfs://nn2:8020/destination

Where:

nn1 is the name node where your data is located
the source is folder where the .tar files are (that is, the folder indicated in the path field when HDFS is configured in DDC)
nn2 is the name node where you want to save your data (nn2 can be the same as nn1)
the destination is the folder to which the .tar files will be copied and saved

Example command:

hadoop distcp \
hdfs://tdp.contoso.com:8020/ciphertrust_ddc \
hdfs://tdp.contoso.com:8020/backup/ciphertrust_ddc

You can find more information on using DistCp on the official webpage or on the Cloudera blog.

Note

The destination folder should be created before executing the discp command.
Note that DistCp requires absolute paths.
These actions are made on ACTIVE NameNode and as hdfs user.

Restore HDFS Backup

To restore a backup you also have to use the DistCp command. Again, the most common use of DistCp is an inter-cluster copy:

hadoop distcp \
hdfs://nn2:8020/destination \
hdfs://nn1:8020/source

Where:

nn1 is the name node where your data is located
the source is the folder where the .tar files are (that is, the folder indicated in the path field when HDFS is configured in DDC)
nn2 is the name node where you want to save your data (nn2 can be the same as nn1)
the destination is the folder to which the .tar files will be copied and saved

Example command:

hadoop distcp \
hdfs://tdp.contoso.com:8020/backup/ciphertrust_ddc \
hdfs://tdp.contoso.com:8020/ciphertrust_ddc

Note

If the file already exists, it will be skipped. New files (not yet backed up) will still be there, and deleted files will be restored.
If you want to completely restore the folder (and only keep the files that were there when the copy was made), you have to execute a command to delete the files. This action is left to your discretion.
These actions must be made on the NameNode and as hdfs user.

HBase Backup/Restore using Snapshots

If you do not want to stop the services, you can use HBase snapshot, list_snapshots and restore_snapshot commands.

To run these commands, please, open a SSH session on the HBase Master node. Let us start by listing all the tables of the original HBase. Please, note that DDC_SCHEMA1 is the schema name defined on the PQS configuration (PQS Tab of Hadoop Services in Settings).

hbase shell
$ hbase(main):001:0> list
TABLE
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
DDC_SCHEMA1:DATA_OBJECT_REPORT
DDC_SCHEMA1:SCAN_EXECUTION_REPORT
SYSTEM:CATALOG
SYSTEM:FUNCTION
SYSTEM:LOG
SYSTEM:MUTEX
SYSTEM:SEQUENCE
SYSTEM:STATS
9 row(s)

Creating a Backup

Take a snapshot for each table. To do that, execute the HBase Take a Snapshot Shell command. Example:

snapshot 'myTable', 'myTableSnapshot-1'

Take snapshots with the commands below:

$ hbase(main):002:0> snapshot 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT', 'myTableSnapshot-datastore_summary_report'
$ hbase(main):003:0> snapshot 'DDC_SCHEMA1:DATA_OBJECT_REPORT', 'myTableSnapshot-data_object_report'
$ hbase(main):004:0> snapshot 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT', 'myTableSnapshot-scan_execution_report'

and list the snapshots:

$ hbase(main):005:0> list_snapshots
SNAPSHOT                                          TABLE + CREATION TIME
myTableSnapshot-datastore_summary_report         DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT (2020-11-24 06:36:13 -0800)
myTableSnapshot-data_object_report               DDC_SCHEMA1:DATA_OBJECT_REPORT (2020-11-24 06:37:05 -0800)
myTableSnapshot-scan_execution_report            DDC_SCHEMA1:SCAN_EXECUTION_REPORT (2020-11-24 06:37:12 -0800)
3 row(s)

Restoring the Backup

Restore the snapshot for the tables that you want. For details, please check official HBase Restore a snapshot documentation.

Restore the backup by executing the commands as follows:

$ hbase(main):006:0> disable 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT'
$ hbase(main):007:0> restore_snapshot 'myTableSnapshot-datastore_summary_report'
$ hbase(main):008:0> enable 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT'
$ hbase(main):009:0> disable 'DDC_SCHEMA1:DATA_OBJECT_REPORT'
$ hbase(main):010:0> restore_snapshot 'myTableSnapshot-data_object_report'
$ hbase(main):011:0> enable 'DDC_SCHEMA1:DATA_OBJECT_REPORT'
$ hbase(main):012:0> disable 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT'
$ hbase(main):013:0> restore_snapshot 'myTableSnapshot-scan_execution_report'
$ hbase(main):014:0> enable 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT'

HBase Backup/Restore using Export

The HBase tables can be exported to HDFS using export command. After that, we can use DistCp to store the data somewhere else.

To export the tables related to DDC, it's necessary to know the Schema that we are using in DDC. The tables are the same for each schema:

DATASTORE_SUMMARY_REPORT, DATA_OBJECT_REPORT and SCAN_EXECUTION_REPORT

How to use Export

First of all, let's list all the tables of our origin HBase using the Shell:

hbase shell
$ hbase(main):001:0> list
TABLE
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
DDC_SCHEMA1:DATA_OBJECT_REPORT
DDC_SCHEMA1:SCAN_EXECUTION_REPORT
SYSTEM:CATALOG
SYSTEM:FUNCTION
SYSTEM:LOG
SYSTEM:MUTEX
SYSTEM:SEQUENCE
SYSTEM:STATS
9 row(s)
$ hbase(main):001:0> quit

Export the tables on a HDFS directory executing HBase export Command:

bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
<tablename> <outputdir>

bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_datastore_summary_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
DDC_SCHEMA1:DATA_OBJECT_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_data_object_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
DDC_SCHEMA1:SCAN_EXECUTION_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_scan_exectuion_report

Note

The output directory must not exist.

How to use Import

If you want to restore the same tables in the same schema, you can execute the hbase import command.

Import the tables from the HDFS directory where the previous export is. To do that, execute the HBase import Command:

bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
<tablename> <inputdir>

bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_datastore_summary_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
DDC_SCHEMA1:DATA_OBJECT_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_data_object_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
DDC_SCHEMA1:SCAN_EXECUTION_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_scan_exectuion_report

Note

The import utility replaces the existing rows, but it does not clone the table keeping the new rows after the export.
There may be inconsistencies in the data, especially in the lates reports.

Suggest A Change

Backups

Preparing for the Backup

Backup Approaches

HDFS

HBase

HDFS Backup/Restore

Create HDFS Backup

Restore HDFS Backup

HBase Backup/Restore using Snapshots

Creating a Backup

Restoring the Backup

HBase Backup/Restore using Export

How to use Export

How to use Import

On this page