Hadoop
This section covers the following topics:
Prerequisites
Supported version: Hadoop 2.7.3 and above
Considerations and Requirements:
DataNode: Nodes where data blocks distributed by Hadoop Distributed File System (HDFS) are stored are called DataNodes. DataNodes are treated as “slaves” in a Hadoop cluster.
NameNode: A node that maintains the index of directories and files and manages data blocks stored on DataNodes is called a NameNode. A NameNode is treated as “master” in a Hadoop cluster.
Hadoop is not supported for Debian-based Linux agents.
To scan a Hadoop cluster with HDFS, you must have:
A Target NameNode running Apache Hadoop 2.7.3, Cloudera Distribution for Hadoop (CDH), or similar.
A Proxy host running the Linux 3 Agent with database runtime components for Linux systems.
A valid Kerberos ticket if Kerberos authentication is enabled. See Generate Kerberos authentication ticket.
Add Hadoop data store
To add the Hadoop data store:
Log on to the CipherTrust Manager GUI.
Open the Data Discovery and Classification application.
Click Data Stores > Data Stores > Add Data Store. The Add Data Store screen is displayed.
Complete the following steps:
Select Type & Category
Under Select Data Store Category, select Big Data.
From Select Server Type, select Hadoop Cluster.
Click Next.
General Info
Specify the following details:
Data Store Name: Name for the data store.
Description (Optional): Description for the data store.
Location Name: Location of the data store.
Add Location: Click Add Location to add new locations to the Location Name drop-down. Refer to Adding Locations for detailed steps.
Sensitivity Level (Optional): Sensitivity level for the data store. Refer to Sensitivity Levels for details.
Enable Data Store: Whether to enable the newly added data store. Select the check box to enable the data store.
Click Next.
Configure Connection
Specify the credentials of the Hadoop domain:
Field Description Hostname/IP Specify Hostname/IP of the Hadoop cluster's active NameNode. Specify a valid hostname, IP address, or Uniform Resource Identifier (URI). The hostname must be longer than two characters. For example, if your HDFS share path is hdfs://hadoop-server-name/share-name
, the host name of the Name Node isshare-name
.Port The port on which the NameNode is accessed. Default is 8020. This is a mandatory field. (Optional) In the Add Label field, enter a label. You can also remove an existing label.
Click Next.
Add Access Control & Tags
(Optional) Grant the
All groups (default)
access for reports. Alternatively, select a group.Click Save.
The data store is added to the Data Stores page. If the Ready to Scan column shows Ready, then data store is properly configured.
For more information on Access control and Tags, expand the section below.
Access Control & Tags
The Access Control & Tags tab on the Add Data Store screen allows you to grant access rights to your data store and add tags. More details below:
ACCESS CONTROL - select user groups that can access the data store. Access to a data store provides ability to see reports that include scans of that data store. The available options are:
All groups: All groups of users can access the data store through reports. This is the default setting.
Selected group/s: Specified user defined groups can access the data store through reports. When this option is selected, select a group from the drop-down list. This list shows existing user defined groups. The user defined groups must already exist on CipherTrust Manager. If no user defined groups exist, ask the administrator to create a group. If needed, you can select multiple groups. Start typing the name of the desired group and select from the suggested groups.
TAGS - Select a tag from the Add Tag drop-down. See the list of prebuilt tags in Predefined tags section.
Tip
New tags can also be added. Start typing a new tag, and click the New: <new_tag> link that appears below the drop-down.
Add as many tags as needed.
To remove a tag, click the close icon in the tag name.
Add Hadoop scan
To add a scan for the Hadoop:
Open the Data Discovery and Classification application.
Click Scans > Add Scan. The Add Scan screen is displayed.
Complete the following steps:
Refer to Scans for the description of sections of the Add Scan screen.
General Info
Specify a Name for the scan.
(optional) Add a Description for the scan.
Expand Advanced Configuration and specify advanced configurations such as Scan Priority, Memory Usage Limit, and Amount of Data Object Volume. Refer to Advanced Configuration for details.
Click Next.
Select Data Stores
Under Data Store Name, select the desired data store that is Ready for scanning. You can select multiple data stores, if required.
Click Next.
Add Targets
To add a scan target, do one of the following:
Add target path manually.
Under the Add Target field, specify the correct target path and click Apply.
If no specific target is added, the entire data store will be scanned.
Navigate and add target paths.
Click Browse to navigate target paths from the root level.
Alternatively, provide an initial path in the Add Target Path field and click Browse to navigate targets from that point onward.
In the left pane, navigate and select the desired target path.
Click Add Path to add the target path to the right pane. Similarly, add other target paths.
Click Add.
Tip
Either navigate the target paths from the root level (without specifying any path in the Add Target Path field) or make sure you provide the correct path to navigate further locations within it.
Click Next.
Select Profiles
Under Classification Profile Name, select the desired classification profiles to search for in the data store. You can select multiple data stores, if required. Refer to Classification Profiles for details on classification profiles.
Click Next.
Add Filters
This step is optional.
Select the desired filter from the Select Filter drop-down list.
To filter the locations to scan an Hadoop data store, consider the following syntax.
Note
Exclude Path/DO by prefix, suffix, and expression filters support wildcard characters. See Using wildcard characters to learn how wildcards work.
Exclude Path/DO by prefix
Excludes paths or data objects that begin with a given string. It can be used to exclude entire directory trees. Specify
<string>
.Exclude Path/DO by suffix
Excludes paths or data objects that end with a given string. Specify
<string>
.Exclude Path/DO by expression
This filter is majorly used with wildcard characters.
Excludes paths or data objects that matches the given expression. Specify
<string>
.For example, to exclude locations that contain 'blob' in their path, use expression *blob*.
Include DO modified recently
Includes data objects modified within N number of days from the current date, where the value of N ranges from 1 to 99 days. After selecting this filter, specify Days from current date.
Exclude DO greater than size
Excludes data objects that are larger than a given file size (in MB). After selecting this filter, specify the file size in MB.
Include DO's within modification date
Includes data objects modified within a given range of dates. After selecting this filter, specify Start and End dates.
Click Apply.
Repeat the above steps to apply multiple filters. Click Remove to remove any applied filter.
Click Next.
Schedule Run
Specify the scan run frequency. The two options are:
Manual: This is the default option. Select this option to run the scan manually. Select the Run Now check box to start the scan run after you save the changes.
Scheduled: Select this option to configure the scan to run automatically at the specified time.
Refer to Schedule Scan for more details on scheduling scan runs.
Click Save.
Generate Kerberos authentication ticket
To generate a Kerberos authentication ticket for your HDFS cluster, run these commands in a terminal on the designated Proxy Agent host.
Check if a valid Kerberos ticket has been issued for the principal user.
klist
Generate a Kerberos ticket as a principal user.
# kinit <username>@<domain> kinit DDCuser@example.com