Backups
At some point, you may want to to make a backup of everything in Hadoop that is related to Data Discovery and Classification. Such a backup will include the HBase tables and all the generated files (.tar
) with the information of the scans (HDFS).
In order to save the DDC data and create a backup you have to perform these two steps (separately):
Back up HDFS
Back up HBase
Note
HBase backup is only necessary if you use the Intelligent Remediation functionality or has some Legacy scans or reports from CipherTrust Manager 2.2. Otherwise, you can skip it.
Preparing for the Backup
Execute the df -h
command in the hdfs directory to be copied and make sure that there is enough space in the destination location.
Note
- It is not necessary to make the backup in each node, but it is good practice make the backup on the name node.
- You need to have *roo
- To run Export/Import commands, you need to switch to the
hdfs
user.
Backup Approaches
HDFS
The best way to create a HDFS backup on a different cluster is to use DistCp. Please refer to HDFS Backup/Restore.
HBase
There are different approaches to creating HBase backups. The recommended ones are:
Make and restore a snapshot: this is the recommended approach. Please refer to HBase Backup/Restore using Snapshots. For deeper details refer to Snapshots section in Cloudera blog "Approaches to Backup and Disaster Recovery in HBase".
Full Shutdown Backup (with a stopping of the service): it is possible to make a complete backup by stopping the services and using the
distcp
command. For details, see the HBase documentation on Full Shutdown Backup.Export and Import Tables. Please refer to HBase Backup/Restore using Export. For more details refer to Export section in Cloudera blog "Approaches to Backup and Disaster Recovery in HBase".
Caution
This option is not a Clone of the table and inconsistencies could appear
For more information about these options, refer to the official Cloudera documentation.
HDFS Backup/Restore
The best way to create a HDFS backup on a different cluster is to use DistCp.
Create HDFS Backup
The most common use of DistCp is an inter-cluster copy:
hadoop distcp \
hdfs://nn1:8020/source \
hdfs://nn2:8020/destination
Where:
nn1 is the name node where your data is located
the source is folder where the
.tar
files are (that is, the folder indicated in the path field when HDFS is configured in DDC)nn2 is the name node where you want to save your data (nn2 can be the same as nn1)
the destination is the folder to which the
.tar
files will be copied and saved
Example command:
hadoop distcp \
hdfs://tdp.contoso.com:8020/ciphertrust_ddc \
hdfs://tdp.contoso.com:8020/backup/ciphertrust_ddc
You can find more information on using DistCp on the official webpage or on the Cloudera blog.
Note
The destination folder should be created before executing the
discp
command.Note that DistCp requires absolute paths.
These actions are made on ACTIVE NameNode and as
hdfs
user.
Restore HDFS Backup
To restore a backup you also have to use the DistCp command. Again, the most common use of DistCp is an inter-cluster copy:
hadoop distcp \
hdfs://nn2:8020/destination \
hdfs://nn1:8020/source
Where:
nn1 is the name node where your data is located
the source is the folder where the
.tar
files are (that is, the folder indicated in the path field when HDFS is configured in DDC)nn2 is the name node where you want to save your data (nn2 can be the same as nn1)
the destination is the folder to which the
.tar
files will be copied and saved
Example command:
hadoop distcp \
hdfs://tdp.contoso.com:8020/backup/ciphertrust_ddc \
hdfs://tdp.contoso.com:8020/ciphertrust_ddc
Note
If the file already exists, it will be skipped. New files (not yet backed up) will still be there, and deleted files will be restored.
If you want to completely restore the folder (and only keep the files that were there when the copy was made), you have to execute a command to delete the files. This action is left to your discretion.
These actions must be made on the NameNode and as
hdfs
user.
HBase Backup/Restore using Snapshots
If you do not want to stop the services, you can use HBase snapshot
, list_snapshots
and restore_snapshot
commands.
To run these commands, please, open a SSH session on the HBase Master node. Let us start by listing all the tables of the original HBase. Please, note that DDC_SCHEMA1
is the schema name defined on the PQS configuration (PQS Tab of Hadoop Services in Settings).
hbase shell
$ hbase(main):001:0> list
TABLE
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
DDC_SCHEMA1:DATA_OBJECT_REPORT
DDC_SCHEMA1:SCAN_EXECUTION_REPORT
SYSTEM:CATALOG
SYSTEM:FUNCTION
SYSTEM:LOG
SYSTEM:MUTEX
SYSTEM:SEQUENCE
SYSTEM:STATS
9 row(s)
Creating a Backup
Take a snapshot for each table. To do that, execute the HBase Take a Snapshot Shell command. Example:
snapshot 'myTable', 'myTableSnapshot-1'
Take snapshots with the commands below:
$ hbase(main):002:0> snapshot 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT', 'myTableSnapshot-datastore_summary_report'
$ hbase(main):003:0> snapshot 'DDC_SCHEMA1:DATA_OBJECT_REPORT', 'myTableSnapshot-data_object_report'
$ hbase(main):004:0> snapshot 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT', 'myTableSnapshot-scan_execution_report'
and list the snapshots:
$ hbase(main):005:0> list_snapshots
SNAPSHOT TABLE + CREATION TIME
myTableSnapshot-datastore_summary_report DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT (2020-11-24 06:36:13 -0800)
myTableSnapshot-data_object_report DDC_SCHEMA1:DATA_OBJECT_REPORT (2020-11-24 06:37:05 -0800)
myTableSnapshot-scan_execution_report DDC_SCHEMA1:SCAN_EXECUTION_REPORT (2020-11-24 06:37:12 -0800)
3 row(s)
Restoring the Backup
Restore the snapshot for the tables that you want. For details, please check official HBase Restore a snapshot documentation.
Restore the backup by executing the commands as follows:
$ hbase(main):006:0> disable 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT'
$ hbase(main):007:0> restore_snapshot 'myTableSnapshot-datastore_summary_report'
$ hbase(main):008:0> enable 'DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT'
$ hbase(main):009:0> disable 'DDC_SCHEMA1:DATA_OBJECT_REPORT'
$ hbase(main):010:0> restore_snapshot 'myTableSnapshot-data_object_report'
$ hbase(main):011:0> enable 'DDC_SCHEMA1:DATA_OBJECT_REPORT'
$ hbase(main):012:0> disable 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT'
$ hbase(main):013:0> restore_snapshot 'myTableSnapshot-scan_execution_report'
$ hbase(main):014:0> enable 'DDC_SCHEMA1:SCAN_EXECUTION_REPORT'
HBase Backup/Restore using Export
The HBase tables can be exported to HDFS using export
command. After that, we can use DistCp to store the data somewhere else.
To export the tables related to DDC, it's necessary to know the Schema that we are using in DDC. The tables are the same for each schema:
DATASTORE_SUMMARY_REPORT, DATA_OBJECT_REPORT and SCAN_EXECUTION_REPORT
How to use Export
First of all, let's list all the tables of our origin HBase using the Shell:
hbase shell
$ hbase(main):001:0> list
TABLE
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT
DDC_SCHEMA1:DATA_OBJECT_REPORT
DDC_SCHEMA1:SCAN_EXECUTION_REPORT
SYSTEM:CATALOG
SYSTEM:FUNCTION
SYSTEM:LOG
SYSTEM:MUTEX
SYSTEM:SEQUENCE
SYSTEM:STATS
9 row(s)
$ hbase(main):001:0> quit
Export the tables on a HDFS directory executing HBase export Command:
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
<tablename> <outputdir>
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_datastore_summary_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
DDC_SCHEMA1:DATA_OBJECT_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_data_object_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
DDC_SCHEMA1:SCAN_EXECUTION_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_scan_exectuion_report
Note
The output directory must not exist.
How to use Import
If you want to restore the same tables in the same schema, you can execute the hbase import command.
Import the tables from the HDFS directory where the previous export is. To do that, execute the HBase import Command:
bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
<tablename> <inputdir>
bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
DDC_SCHEMA1:DATASTORE_SUMMARY_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_datastore_summary_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
DDC_SCHEMA1:DATA_OBJECT_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_data_object_report
bin/hbase org.apache.hadoop.hbase.mapreduce.Import \
DDC_SCHEMA1:SCAN_EXECUTION_REPORT \
hdfs:///ddc_backup/hbase/ddc_schema1_scan_exectuion_report
Note
The import utility replaces the existing rows, but it does not clone the table keeping the new rows after the export.
There may be inconsistencies in the data, especially in the lates reports.