Scans
DDC provides pattern-matching scan capabilities based on regular expressions. DDC ML, an extension of DDC, enhances DDC's scan capabilities by using semantic context-based discovery, classification of sensitive entities, and enabling similarity search. Refer to DDC ML Infotypes for infotypes supported by DDC ML.
DDC ML only supports local and network storage data stores.
You can manage scans using the Scans page. The Scans page can be accessed by clicking the Scans link in the Data Discovery sidebar on the left.
From the Scans page you can:
View all currently available scans. See Viewing scans.
Create a new scan. See Adding scans.
Run a scan manually. See Running scans.
Delete a scan. See Removing scans.
Modify an existing scan. See Editing scans.
View scan history. See Viewing scan history.
Create a copy of a scan. See Duplicating scans.
Viewing scans
The graphical view shows the following details:
Scan Age: Displays a pie chart representing the distribution of scans conducted over different periods.
Scanned Data Objects: Displays the count of scanned data objects, distinguishing between sensitive and non-sensitive types.
Click the refresh button to refresh the displayed information.
The list view of the Scans page shows the following details:
Item | Description |
---|---|
Scan Name | Name of the scan. Click the scan name to view scan configuration details. See Adding scans to understand details of the configuration section. |
Status | Status of the scan. For more information, see Scan Statuses. |
Duration | Time taken to complete the run. |
Last Scan | Time when the scan last ran. |
Schedule | Schedule of the scan. |
Profiles | Number of classification profiles. |
Tip
If you are planning to perform a CipherTrust Manager upgrade, make sure that you do not have scans in progress.
Use the Search text box to filter scans. Search results display scans that contain specified text in their names. By default, scans are listed in ascending order by name, but you can also sort them by last scan time, duration, and status.
Click the expand arrow () next to the scan name to view additional scan details, including the number and list of data stores scanned, and the number of target paths. These details are available for the Completed, Partially Completed, and Failed scans.
In the expanded panel, click View details to see the scan status for each target path in the data store. You can hover over the status to see the error message related to the scan failure. Click Close to return to the Scans page.
Adding scans
To add a scan, navigate to the Scans screen (Data Discovery > Scans). Click the +Add Scan button to open the Add Scan screen.
In the screen, you have to go over these configuration steps for each scan that you add:
General Info - Name the scan, provide short description, and advance configuration.
Select Data Stores - Select which data stores will be scanned.
Add Targets - Narrow down the scan scope by selecting specific scan targets.
Select Profiles - Choose which Classification Profile you want to scan for.
Apply Filters - Add a list of rules to filter some targets when the scan is launched.
Schedule Scan - Configure when you want your scan to run.
General Info
In the General Info screen, specify a unique name and short description for the scan.
Name - The name must be longer than two characters and up to 64 characters.
Description - (optional) Description can be of up to 250 characters.
Expand Advanced Configuration and specify the following details:
Parameter Description Scan Priority Set the scan priority relative to other applications in terms of CPU utilization. You can select Low or Normal. The default setting is Low Priority. It applies only to local storage. If you want to increase the scan performance, set the scan priority to Normal. Content Supported
(Not applicable for DDC ML)Select content type that the scan will process: - OCR - Scans images for sensitive data using Optical Character Recognition (OCR). By default, it is disabled (the scanning of images will be skipped).For more details, see Using OCR in Scans.
- EBCDIC: Scan file systems that use IBM's EBCDIC encoding. By default, it is disabled.
Note
Use EBCDIC mode only if you are scanning IBM mainframes that use EBCDIC encoded file systems. This mode forces scanning of targets as EBCDIC encoded file systems, which means that it does not detect matches in non-EBCDIC encoded file systems.Trace Logs Use toggle switch to enable trace logs and capture detailed scan trace messages when scanning a target. By default, it is disabled. For more information, see Viewing Scan Log. Note
- You need to run the scan again after enabling trace logs in order to download them.
- Trace logs may take up a large amount of disk space, depending on the size and complexity of the scan, and may impact system performance. Enable this feature only for troubleshooting.
- DDC supports trace logs for multiple scan statuses. Refer to Scan Statuses for the list of supported statuses and the supported trace log download formats.
Enable Indexing Use toggle switch to enable or disable similarity search on the scan results. When it's enabled, the similarity search can be performed on the scan results. This option becomes available after onboarding MLaaS. See Configuring MLaaS. Memory Usage Limit (MB) Set the maximum memory usage (in MB) that the scanner service can use on the data store host. The default memory usage limit is 2048 MB. If you want to increase the scan performance, set the memory usage limit between 4GB to 8GB. Throughput (MBps) Set the maximum I/O rate (in MBps) that the scanner service will use to read data from the data store. By default, it is set to 0 (for unlimited). Amount of Data Object Volume Select the amount of data object volume prioritizing either number of data objects or metadata per data object. Choose from: - Low - maximum metadata: Captures maximal detail per file.
- Medium - core metadata: Balances quantity of files and matching detail in each file.
- High - minimal metadata: Results in a more even spread of match data across a large quantity of files. This is the default option.
Number of Rows (Only for Relational Database DS) (Optional) Set the number of rows to scan in each relational database table. The maximum number of allowed rows for the database scan is 2147483647
.
The supported databases are:- IBM DB2 Oracle Microsoft SQL PostgreSQL SAP HANA MySQL Teradata
Note
- If you don't specify the number of rows to scan, the entire databases will be scanned. The number of rows to scan for all tables are selected in the descending order of the primary key.
- SAP HANA - Rows are selected in the ascending order.Teradata - Rows are selected randomly, if no primary key is defined.
The Restore Defaults button resets the advanced settings to their default values, however, if you already previously modified these settings for a scan and ran it with the changed configuration the Restore Defaults button will roll back the changes to the last saved configuration. In other words, the Restore Defaults button only reverts the current modifications.
Click Next to move on to the Select Data Stores screen.
Select Data Stores
The Select Data Stores screen lists all data stores in tabular form. By default, no data stores are selected. The table has three columns:
Data Store Name: Lists available data stores (with their number).
Type: The type of the data store, such as Local Storage, Network Share, etc.
Ready to Scan: Displays if the agent that is connected to the data store. In this column, you can also see if the Agent is ready (that is, if the data store is ready).
Note
Ready/Not Ready data store: A scan cannot run unless there is an identified Agent for every data store included in the scan. Data store that has an agent associated with it has the status Ready. If a scan contains a data store with Not Ready state, scan run will fail and display an error. If there are more than one data store with Not Ready state associated with a scan, the scan run will fail on the first data store that is in Not Ready state and will not scan the remaining data stores.
Disabled/Enabled data store: You can manually deactivate a data store from the Data Store page. Deactivated data store has a status Disabled and is not scanned. A scan will successfully run (without an error) if it has at least one enabled data store and several disabled data stores associated with it. However, only the enabled data stores will be scanned. If a scan contains all Disabled data stores, then scan will not run at all.
To select a data store to scan:
Search for the desired data stores by specifying the search criteria in the Search box. The search results will be displayed in the table under it.
Select a data store for the scan by selecting the corresponding check box. Similarly, select multiple data stores, if needed.
Tip
Use the Selected only toggle switch to display only the selected data stores or all data stores (if the switch is 'off' all data sources are displayed).
Click Next to move on to the Add Targets screen.
Add Targets
In the Add Targets screen you can review a list of the data stores that you selected for the scan. By default, the scan will scan the entire data store, and this screen allows you to narrow down the scan scope by selecting specific targets for your selected data stores. The Add Targets screen is divided into three columns:
Data Store Name: Lists selected data stores.
Add Target: You can type in the complete target path in the field and add it to the scan parameters. Or, you can use the Browse button to navigate the target path from the root level or starting from an initial path and add it to the scan parameters. The scan will be performed only on the selected target paths.
Note
For SAP HANA, the full data store scan is not supported. You need to specify at least one path.
When adding Oracle and IBM DB2 targets, specify the table name exactly as in the database. Table names are case-sensitive.
Any scan target that you add must be valid, otherwise the scan will fail. For more information on what a valid scan target is, see Target Format Limitations.
For performance sake, try running smaller scans and then generate a report in which you aggregate them. You may schedule a different scan per Data Store and/or per Classification Profile and/or subpaths (such as folders and tables) in the original scan path.
You can scan emails in a Gmail label if you move the emails to the label - otherwise, they will be kept in your inbox. For the default system labels, Gmail creates some folders that do not match the label name. Please refer to the Gmail documentation to learn the right path to scan a particular system label.
In the case of a Sharepoint Server data store with an API passwords file configured, you have to use an empty target path.
To perform a scan on an Office365: OneDrive for Business or Exchange Server data store you have to specify a scan target path. For details, see Target Format Limitations.
To add a scan target for a selected data store, do one of the following:
Type complete scan target path in the Add Target Path field and click Apply.
Navigate and add target paths.
Click Browse to navigate target paths from the root level. Alternatively, provide an initial path in the Add Target Path field and click Browse to navigate targets from that point onward.
Note
Paths are case sensitive. Providing incorrect initial paths may lead to unexpected results. See issues encountered while browsing target paths for more details.
Tip
Either navigate the target paths from the root level (without specifying any path in the Add Target Path field) or make sure you provide the correct path to navigate further locations within it.
In the left pane, navigate and select the desired target path.
Note
To view subfolders within the folder hierarchy of a SharePoint Online or SharePoint Server data store, select the folder name and click List.
Click Add Path to add the target path to the right pane. Similarly, add other target paths.
Click Add.
To remove a scan target for a selected data store:
Click the arrow button next to the data store name for which you want to remove a scan target.
Click the Remove link on the right of the scan target to remove it.
Once all targets are added, click Next to move to the Select Profiles screen.
Tip
Make sure that you do not have nested target paths in a scan for the same data store. This can affect the performance of the scan and you can get duplicated data in the reports.
Select Profiles
The Select Profiles screen lists all classification profiles in tabular form. By default, no profiles are selected. The table has two columns:
- Classification Profile Name: Lists available profiles. Items marked with a letter "T" are predefined classification profile templates. For more information about these templates, see Classification Profile Templates. The other items are custom classification profiles.
Sensitivity: Displays the sensitivity level assigned to this classification profile. See Sensitivity Levels for more information.
To select a classification profile for the scan:
Use the search box to specify the search criteria for the desired profiles. The search results are displayed in the table under it.
Select the check boxes corresponding to desired profiles for selecting the profiles for the scan.
Tip
Use the Selected only toggle switch to display only the selected classification profiles or all classification profiles (if the switch is 'off' all classification profiles are displayed).
Click Next.
Note
The Next button remains disabled if any of the selected classification profiles contains only the ML infotype. This button is enabled if all the selected classification profiles contain at least one built-in or custom infotype.
Apply Filters
In the Apply Filters screen you can add a list of rules to filter some targets when the scan is launched. By default, there are no filters applied, and in this step you can add specific rules which affect the data stores selected and their targets (if you specified any). You can configure as many filters as you want.
Click the Select Filter menu to expand it. The menu shows you the filters as follows:
Exclude Path/DO by prefix
This filter excludes search locations with paths that begin with a given string. It can be used to exclude entire directory trees. Example of such a filter: c:\windows\system32
Exclude Path/DO by suffix
This filter excludes search locations with paths that end with a given string. For example, entering led.jnl, excludes files and folders such as canceled.jnl, totaled.jnl.
Exclude Path/DO by expression
This filter excludes search locations by expression. Wildcards '*' and '?' can be used to form expressions for this filter. For example, *data.txt excludes files that end by "data.txt" in any path.
Include DO modified recently
Use this filter to include search locations modified within a given number of days from the current date. For example, enter 14 to display files & folders that have been modified not more than 14 days before the current date.
Exclude DO greater than size
This filter excludes files that are larger than a given file size (in MB).
Include DO's within modification date
Use this filter to include search locations modified within a given range of dates. Files and folders that fall outside of the range set by the selected start and end date are not scanned.
Note
The exclude Path/DO by prefix, suffix, and expression filters support wildcard characters. See Using wildcard characters to learn how wildcards work.
For each new filter added click Apply to save and apply its rules.
Click Next to move on to the Schedule screen.
Note
Filters are case insensitive. That is to say, if you have two directories, "TEST" and "test" and apply the filter */test both directories will be excluded as a result. The same goes for filenames.
Depending on the type of data store, some considerations should be taken into account. For more information, see individual data store pages at Discovering sensitive information.
Schedule Scan
Scans can be run either manually or automatically at a scheduled time. To configure this:
In the Schedule screen select the frequency with which you want the scan to run. The options are:
Manual: Select to run the scan manually. This is the default setting. In this case the scan will be run whenever you manually launch it from the Scans screen. For more information about running a scan manually, see Running Scans.
Automatic Scan Pause - Use this switch to schedule the time when a scan will pause. For example, you should pause all scans (by using the automatic scan pause) during working hours so they do not affect production servers. Next, use the Time Zone and Select days and time controls to set the days and time when the scan should pause.
Run Now - If you select this check box, the scan will be run just once after the scan is added successfully.
Scheduled: Select to specify a schedule for the run. The scan will be run automatically on the specified schedule. When Scheduled is selected, the following fields appear on the screen:
Increment: Select the increment pattern of the run. This is a mandatory field. The options are Daily, Weekly, and Monthly. By default, Daily is selected.
Every: Specify when the run should repeat. This is a mandatory field.
For example, if Daily is selected as Increment, enter 2 to run the scan once every two days. If Weekly is selected as Increment, enter 2 to run the scan once every two weeks. Similarly, if Monthly is selected as Increment, enter 2 to run the scan once every two months.
Time: Specify the time when the run should start. This is a mandatory field. Specify the time in 12-hour format.
Time Zone: Select a time zone form the drop-down list.
Starting: Specify the day when the schedule should start. This is a mandatory field. By default, Today is selected. To specify a particular start date, select On this date, click the calendar icon, and select the date.
Ending: Specify the day when the schedule should end. This is a mandatory field. By default, No End is selected. To specify a particular end date, select On this date, click the calendar icon, and select the date.
Automatic Scan Pause - Use this switch to schedule the time when a scan will pause. For example, you should pause all scans (by using the automatic scan pause) during working hours so they do not affect production servers. Next, use the Time Zone and Select days and time controls to set the days and time when the scan should pause.
Click Save to complete adding the scan.
As a result, the newly created scan appears on the Scans page. By default, scans are displayed in alphabetic order by name. Depending on the number of entries per page, you might need to navigate to other pages to view the newly created scan. By default, the Status of a newly created scan is Unscanned.
Note
If your CM system clock does not match the Agent's system clock, your scans will not run as scheduled, so it is highly recommended to set up a NTP server to synchronize the clocks. This can be achieved in CM through the Admin Settings -> System -> NTP. For details, refer to the Thales CipherTrust Manager Administrator Guide.
Running scans
To run a scan manually:
Navigate to the Scans screen (Data Discovery > Scans).
Search for the scan to run.
Tip
Use the Search text box to filter scans. Search results display scans that contain specified text in their names.
By default, scans are listed in ascending alphabetic order of their names. Scans can be sorted by their name, last scan time, duration, and status.
Move the mouse pointer to the row that contains the scan. The Run Now button appears. This button disappears as soon as the mouse pointer is moved out of the row.
Click Run Now.
As soon as the scan is initiated, its status changes to Pending, then the status changes to Processing. If the automatic scan pause is configured for the scan and you are running it within the set time window, the status of that scan will be Autopaused throughout the duration of the time window. After that, the scan is resumed. For more details on the scan auto pause feature, refer to the information in Schedule scan.
Scan statuses
The status of the scan changes in the sequence: Unscanned > Validating > Pending > Running now / Paused / Stopped > Processing > Completed / Failed / Partially Completed.
The progress of the scan (that is, its current status) is displayed in the Status column on the Scans screen.
The table below provides information about possible statuses and their log download formats.
Status | Log Download Format | Description |
---|---|---|
Unscanned | - | By default, the Status of a newly created scan is Unscanned. |
Validating | JSON | Checking if all the data stores are ready. |
Pending | JSON | Scan is pending and the linked data stores are being contacted. Depending on factors such as the network connectivity, this stage may: • Complete in a flash. You may not see it on the Scans page. • Remain for some time in this state. |
Running now, Paused, Stopped | JSON | Scan is running, paused, or stopped. See Scan progress for more information. |
Autopaused | JSON | Scan is paused as a result of automatic scan pause. |
Uploading | - | Scan results are being uploaded to TDP. |
Indexing | JSON | (Applicable to DDC ML scans) Scan is creating entity and document indexes for similarity search. |
Processing | JSON | Scan is processing the collected data. |
Completed | JSON | Scan was successful. |
Failed | JSON | For some reason, the scan failed. Hover the mouse over the "Failed" icon to learn more about the reason why it failed. |
Syncing | JSON | DDC is communicating with the agents to sync the status of ongoing scans that were active when Ciphertrust Manager was last stopped. Displays when Ciphertrust Manager starts. |
Partially completed | JSON | Scan is partially completed. A scan is partially completed when at least one, but not all, of its target paths have been fully scanned. Note DDC ML currently does not support partial scanning of data stores. This functionality is only available with regex-based scans. |
Note
DDC will always select an agent for every data store when the scan execution begins. It could be the same agent as the previously assigned one, or a new one, regardless of the health status of the assigned agent.
Scan progress
The progress status of Running and Paused scans is displayed in the form of a progress bar accompanied by a numeric percentage value.
Additionally, you can click the magnifying glass on the right of the progress bar to see detailed information about the scan progress.
Scan in progress displays the following information:
Progress bars:
The Regex Process shows the percentage of regular expressions processed by the scan (to check the scan path).
The ML Process shows the percentage of DDC ML scan (to check the scan path).
The Regex Process and ML Process tabs show up to 5 unfinished scan paths per Hostname/IP ("...% completed") with detailed information displayed in a table, in these columns:
Data Store - Identifier of the data store on which scan is running. Generally, the identifier matches with the data store's Hostname/IP, but depending on the data store type, it might show different value.
Paths - Currently scanned scan paths.
% completed - Scan completion of the current scan path in percent.
Matches - Number of matches (sensitive items) found in the current scan path.
The Agents tab shows details of the standard agents running the scan in the following columns:
Data Stores - Identifier of the data store on which the scan is running.
Agent Name - Agents that are executing scan on the data store.
Last Agent Connected - Time when the agent was last connected. This value is updated every five minutes.
Status - Connectivity status of the agent (Connected or Not Connected).
Note
For disconnected agents, Last Agent Connected column doesn't display the time of disconnection. It only displays the most recent time when agent was connected.
The Agents tab doesn't show details of the ML agents running the scan.
Editing scans
To edit a scan:
Log on to the DDC console.
Open the Data Discovery application.
In the left pane, click Scans. The Scans page is displayed. This page lists available scans.
Search for the scan to edit.
Use the Search text box to filter scans. Search results display scans that contain specified text in their names.
By default, scans are listed in ascending alphabetic order of their names.
Scans can be sorted by their name, last scan time, duration, and status.
Click the ellipsis icon (
) corresponding to the desired scan. A shortcut menu appears.
Click View/Edit from the shortcut menu.
The selected scan is displayed, with its configuration settings distributed over these sections (which are exactly the same as the steps of the Add Scan screen):
GENERAL
DATA STORES
TARGETS
CLASSIFICATION PROFILES
APPLY FILTERS
SCHEDULE
Select Run Now to initiate the scan run after any configuration change. This check box is available for scans that are not in the running state.
For more details on these sections, refer to the Adding scans section.
Click Expand All to expand all sections or a plus button (+) in the section in which you want to edit the scan configuration to expand just that section. For information on the available settings, refer to Adding scans.
Make the desired changes and click Save Changes to save the changes.
When you edit a scheduled scan that was disabled, it gets automatically enabled.
When you edit a scan, you must run it again to see the corresponding report.
Viewing scan log
You can download and view a log of a selected scan if it has trace logs enabled in the advanced scan configuration settings. For more information, see the "Advanced Configuration" under the General Info section.
In the Scans screen, click the ellipsis icon (
) corresponding to the desired scan. An context menu is displayed, with the Download Logs option available.
Click Download Logs in the menu. A dialog box with information "Download logs? Logs for scan "Xyz" are available for downloading." is displayed.
Click the Download button in the dialog box to confirm the download.
DDC supports the JSON format for log download. Refer to Scan statuses for the supported log download formats for different scan statuses.
Note
Trace logs are only available for download when Trace Logs toggle in advanced configuration is enabled, otherwise only troubleshooting logs are available for download.
When you try to download logs of scans in the Running/Stopped/Paused/Autopaused/Interrupted state, latest logs until that point in time are downloaded.
When the Scan Trace Logs download is in progress for a scan, downloading the trace log for that scan in parallel is not allowed. You can trigger the API again only after the previous request is complete.
The Scan Trace Logs can be huge in size depending on the amount of data being scanned, and can also take considerable amount of time in processing before the download starts.
The information written to the log has the following format:
For JSON Format
Parameter | Data Type | Description |
---|---|---|
timestamp | number | Time stamp (Unix time format) for each action that happened on a path or location during a scan. |
action | string enum: source, opening, opened, parsing, decoding, decoded, completed, scanning, inaccess | Action performed on a path or location during a scan. |
agent_name | string | Name of the Agent that performed the scan. |
path | string | Full path where the action happened. |
Removing scans
In the Scans screen, use the Search text box to filter scans and search for the scan that you want to remove.
Click the ellipsis icon (
) corresponding to the desired scan. An overflow menu is displayed, with the Remove option available.
Note
The Remove option is not always available in the menu, only if a scan is Failed, Completed, Stopped, or Disabled.
Click Remove in the menu. As a result, a warning message "Remove Scan? Are you sure you want to remove this scan?" is displayed.
Click the Remove button in the warning message window to confirm the removal of the selected scan.
Viewing scan history
You can view the history details of past scan executions and download their logs.
To view the history details of a scan execution:
On the Scans page, click the ellipses icon (...) corresponding to the desired scan.
Select View Executions.
The
Execution History page displays the scan execution history in following columns:Column Name Description Scan Executions Displays scan execution time stamp in descending order. Status Displays the status of scan execution (Failed, Completed, or Stopped). Use the filter button to filter scan execution by their status. Duration Displays the duration of scan execution Logs Allows you to download scan logs. Click the download button to download logs.
Duplicating scans
You can make copies of existing scans for creating new variants and reducing manual effort of creating scans from scratch. All the configuration details, classification profiles, data stores, target locations, filters, access & tags, schedules, and all other details of the existing scan are replicated in the cloned copy.
To make a copy of a scan:
Navigate to Scans screen (CipherTrust Manager > Data Discovery and Classification > Scans).
Click the ellipses icon (...) next to the desired scan that you want to clone.
Select Clone.
Provide a unique name for the new scan.
Click Clone.
Using Optical Character Recognition in scans
DDC features Optical Character Recognition (OCR) on a number of image file formats. The formats that can be recognized are JPG / JPEG, BMP, PNG, GIF, TIFF, and PDF that contains any of these image formats.
Note
OCR scans will usually have a lower accuracy than raw text data scans. They may not always recognize all characters in an image due to multiple factors such as poor image quality, unusual fonts, and complex layouts. This may cause unexpected data object matches.
OCR caching
The DDC scanning engine caches the result of OCR on an image within a scan, which can then be reused if the same image is later found in multiple locations within the same scan, for example, when scanning data sources like email in which identical images frequently occur in different email messages.
OCR limitations
The OCR mechanism employed by DDC has the following limitations:
It cannot detect handwritten information - only typed or printed characters.
It does not find information stored in screenshots or images of lower quality. The images you scan with OCR enabled must have a minimum resolution of 150 dpi (300dpi or higher is recommended).
At the same time, the accuracy of scans involving OCR will depend on:
The quality of the image. Any noise in the image such as scanner marks, lines or soft color tones, dust from scanned images, etc.
The format of the image. Some image formats will result in better detection rates (lossless vs lossy compression).
Font face, font size and context stored in the image. Fonts within scanned images must be at least 10pt in size. Fonts below that size will not be reliably detected. Abnormally styled fonts may not be clear or consistent.
Note
OCR is not supported for HP UX 11.31+ (Intel Itanium) and Solaris 9+ (Intel x86) operating systems.