Please Note:

Managing Scans

You manage scans through the Scans page, which is accessed by clicking the Scans link in the Data Discovery sidebar on the left.

From the Scans page you can:

View all currently availablae scans. See Viewing Scans.
Create a new scan. See Adding Scans.
Run a scan manually. See Running Scans.
Delete a scan. See Removing Scans.
Modify an existing scan. See Editing Scans.

Viewing Scans

The list view of the Scans page shows the number of:

Scans with the number of executed and unexecuted scans.
Executed scans with the number of scans containing sensitive and non-sensitive data.
Scanned data objects with the number of sensitive and other data objects.

Click the refresh button to refresh the displayed information.

The list view of the Scans page shows the following details:

Item	Description
Scan Name	Name of the scan.
Status	Status of the scan. For more information, see Scan Statuses.
Duration	Time taken to complete the run.
Last Scan	Time when the scan last ran.
Schedule	Schedule of the scan.
Profiles	Number of classification profiles.

Tip

If you are planning to perform a CipherTrust Manager upgrade, make sure that you do not have scans in progress.

Use the Search text box to filter scans. Search results display scans that contain specified text in their names.
By default, scans are listed in ascending alphabetic order of their names.
Scans can be sorted by their name, last scan time, duration, and status.

Adding Scans

To add a scan, navigate to the Scans screen (Data Discovery > Scans). Click the +Add Scan button to open the Add Scan wizard.

In the wizard, you have to go over these configuration steps for each scan that you add:

General Info - Name the scan and give a short description.
Select Data Stores - Select which data stores will be scanned.
Add Targets - Narrow down the scan scope by selecting specific scan targets.
Select Profiles - Choose which Classification Profile you want to scan for.
Apply Filters - Add a list of rules to filter some targets when the scan is launched.
Schedule Scan - Configure when you want your scan to run.

General Info

In the General Info screen, the wizard asks you to specify a unique name for the scan and to give it a short description:
- Name - The name must be longer than two characters and up to 64 characters.
- Description - optional description of up to 250 characters.
Click the Advanced configuration menu to expand it and access the additional scan configuration parameters:
- Scan Priority - Set the scan priority relative to other applications in terms of CPU utilization. You can select Low or Normal. The default setting is Low Priority. It applies only to local storage.
- Content Supported - Select content type that the scan will process:
  
  - OCR - Scans images for sensitive data using Optical Character Recognition (OCR). By default, it is disabled (the scanning of images will be skipped). See Using OCR in Scans for more details.
  - Voice - Enables voice recognition when scanning WAV and MP3 files. By default, it is disabled (the scanning of voice files will be skipped).
  - EBCDIC - Scan file systems that use IBM's EBCDIC encoding. By default, it is disabled.
  Note
  Use EBCDIC mode only if you are scanning IBM mainframes that use EBCDIC encoded file systems. This mode forces scanning of targets as EBCDIC encoded file systems, which means that it does not detect matches in non-EBCDIC encoded file systems.
- Trace Logs - use the switch to enable trace logs and capture detailed scan trace messages when scanning a target. By default, it is disabled. For more information, see Viewing Scan Log.
  Note
  - You need to run the scan again after enabling trace logs in order to download them.
  - Trace Logs may take up a large amount of disk space, depending on the size and complexity of the scan, and may impact system performance. Enable this feature only for troubleshooting.
- Memory Usage Limit (MB) - Set the maximum memory usage (in MB) that the scanner service can use on the data store host. The default memory usage limit is 1024 MB.
- Throughput (MBps) - Set the maximum I/O rate (in MBps) that the scanner service will use to read data from the data store. By default, it is set to 0 (for unlimited).
- Amount of Data Object Volume - Select the amount of data object volume prioritising either quantity of data objects or information of matches per data object. Choose from:
  - Low - maximum info: Captures maximal detail per file.
  - Medium - core info: Balances quantity of files and matching detail in each file.
  - High - minimal info: Results in a more even spread of match data across a large quantity of files.
  - High - no saved info: The scan runs processing the maximum number of objects but without saving any information of the matches found. This is the default option.
The Restore Defaults button resets the advanced settings to their default values, however, if you already previously modified these settings for a scan and ran it with the changed configuration the Restore Defaults button will roll back the changes to the last saved configuration. In other words, the Restore Defaults button only reverts the current modifications.
Click Next to move on to the Select Data Stores screen.

Select Data Stores

The Select Data Stores screen lists all data stores in tabular form. By default, no data stores are selected. The table has three columns:

Data Store Name: Lists available data stores (with their number).
Type: The type of the data store, such as Local Storage, Network Share, etc.
Agent: Displays the Agent that is connected to that data store. In this column, you can also see if the Agent is ready (that is, if the data store is ready).

To select a data store to scan:

Search for the desired data stores by specifying the search criteria in the Search box. The search results will be displayed in the table under it.
Select a data store for the scan by selecting the corresponding check box. Similarly, select multiple data stores, if needed.
Tip
Use the Selected only toggle switch to display only the selected data stores or all data stores (if the switch is 'off' all data sources are displayed).
Click Next to move on to the Add Targets screen.

Add Targets

In the Add Targets screen you can review a list of the data stores that you selected for the scan. By default, the scan will scan the entire data store, and this wizard step allows you to narrow down the scan scope by selecting specific targets for your selected data stores. The Add Targets screen is divided into three columns:

* **Data Store**: The list of selected data stores.

* **Targets**: Any selected specific target for the listed data store. "Full DS" indicates that no specific target has been selected, that is, the entire data store will be scanned. If you have added a scan target for the data store, it will be listed after you expand the data store row (by clicking the arrow button next to the data store name, on the left).

* **Add Target Path**: In this field you can type in a specific target and add it to the scan parameters. Scanning of this data store will be limited to the added target only.

* When adding Oracle and IBM DB2 targets, specify the table name exactly as in the database. Table names are case-sensitive.

* Any scan target that you add must be **valid**, otherwise the scan will fail. For more information on what a valid scan target is, see [Target Format Limitations](#target-format-limitations).

* For performance sake, try running smaller scans and then generate a report in which you aggregate them. You may schedule a different scan per Data Store and/or per Classification Profile and/or subpaths (such as folders and tables) in the original scan path.

* You can scan emails in a Gmail label if you move the emails to the label - otherwise, they will be kept in your inbox. For the default system labels, Gmail creates some folders that do not match the label name. Please refer to the [Gmail documentation](https://developers.google.com/gmail/api/guides/labels) to learn the right path to scan a particular system label.

* In the case of a Sharepoint Server data store with an API passwords file configured, you have to use an empty target path.

!!! tip

    * To perform a scan on an Office365: OneDrive for Business or Exchange Server data store you **have to specify a scan target path**. For details, see [Target Format Limitations](#target-format-limitations).

To add a scan target for a selected data store:
1. Type your scan target in the Add Target Path field.
2. Click the Apply button on the right to add the target.
Repeat this to add more scan targets for that data store, if needed.
To remove a scan target for a selected data store:
1. Click the arrow button next to the data store name for which you want to remove a scan target.
2. Click the Remove link on the right of the scan target to remove it.
Use the Enable Remediation toggle switch to enable remediation for the selected target.
Remediation is currently only supported on local storage and network type (NFS and SMB/CIFS share) data stores. In case of other types of data stores the switch will not be displayed at all.
The use of remediation requires a supported and properly configured data store, that is, one that has a GuardPoint (GP) created in CTE. For details on what a properly configured data store for remediation is, refer to the information in Remediation.
If a data store is of a supported type but is not properly configured, the Enable Remediation switch will be inactive. The message displayed on mouse over on the inactive switch will inform you about the reason for it being inactive. For all possible messages related to the remediation process, refer to the information in Remediation Messages.
- When remediating a data store that does not have sub-paths, the checkbox will indicate "Remediate Root Path". This will result in the entire data store being scanned and remediated.
  Note
  Remediation of the root path is currently only supported for an SMB data store.
- When a data store has at least one sub-path, the checkbox will indicate "Remediate these sub-paths". In this case, only the selected sub-paths will be scanned and remediated.
  Warning
  If you add add some paths that can be remediated, but do not have a GuardPoint (because it is in the root path), the "Remediate Root Path" checkbox will be unchecked. Although the root path has a GuardPoint, the sub-paths cannot be treated as remediable, since they do not have a GP configured (because the GP is in the root path - that is on the entire data store). In this case, also the "Enable Remediation" checkbox will be cleared.
To move on to the Select Profiles screen, click Next.
Tip
Make sure that you do not have nested target paths in a scan for the same data store. This can affect the performance of the scan and you can get duplicated data in the reports.

Select Profiles

The Select Profiles screen lists all classification profiles in tabular form. By default, no profiles are selected. The table has three columns:

* **Classification Profile Name**: Lists available profiles. Items marked with a letter "T" are predefined classification profile templates. For more information about these templates, see "Classification Profile Templates". The other items are custom classification profiles.

* **Infotypes**: Displays the number of information types associated with the profile.

* **Sensitivity**: Displays the sensitivity level assigned to this classification profile. See "Sensitivity Levels" for more information.

To select a classification profile for the scan:

Search for the desired profiles by specifying the search criteria in the search box. The search results are displayed in the table under it.
Select profiles for the scan by selecting the check boxes corresponding to desired profiles.
Tip
Use the Selected only toggle switch to display only the selected classification profiles or all classification profiles (if the switch is 'off' all classification profiles are displayed).
Click Next to move on to the Schedule screen.

Apply Filters

In the Apply Filters screen you can add a list of rules to filter some targets when the scan is launched. By default, there are no filters applied, and in this step you can add specific rules which affect the data stores selected and their targets (if you specified any). You can configure as many filters as you want.

Click the Select Filter menu to expand it. The menu shows you the filters as follows:
- Exclude location by prefix
  This filter excludes search locations with paths that begin with a given string. It can be used to exclude entire directory trees. Example of such a filter: c:\windows\system32
- Exclude location by suffix
  This filter excludes search locations with paths that end with a given string. For example, entering led.jnl, excludes files and folders such as canceled.jnl, totaled.jnl.
- Exclude locations by expression
  This filter excludes search locations by expression. Wildcards '*' and '?' can be used to form expressions for this filter. For example, *data.txt excludes files that end by "data.txt" in any path.
- Include locations modified recently
  Use this filter to include search locations modified within a given number of days from the current date. For example, enter 14 to display files & folders that have been modified not more than 14 days before the current date.
- Exclude locations greater than file size
  This filter excludes files that are larger than a given file size (in MB).
- Include locations within modification date
  Use this filter to include search locations modified within a given range of dates. Files and folders that fall outside of the range set by the selected start and end date are not scanned.
For each new filter added click Apply to save and apply its rules.
Click Next to move on to the Schedule screen.
Note
Filters are case insensitive. That is to say, if you have two directories, "TEST" and "test" and apply the filter */test both directories will be excluded as a result. The same goes for filenames.

Depending on the type of data store, some considerations should be taken into account. For more info, see Scan Filter Usage.

Schedule Scan

Scans can be run either manually or automatically at a scheduled time. To configure this:

In the Schedule screen select the frequency with which you want the scan to run. The options are:
- Manual: Select to run the scan manually. This is the default setting. In this case the scan will be run whenever you manually launch it from the Scans screen. For more information about running a scan manually, see Running Scans.
  - Automatic Scan Pause - Use this switch to schedule the time when a scan will pause. For example, you should pause all scans (by using the automatic scan pause) during working hours so they do not affect production servers. Next, use the Time Zone and Select days and time controls to set the day(s) and time when the scan should pause.
  - Run Now - If you select this checkbox, the scan will be run just once after the scan is added successfully.
- Scheduled: Select to specify a schedule for the run. The scan will be run automatically on the specified schedule. When Scheduled is selected, the following fields appear on the screen:
  - Increment: Select the increment pattern of the run. This is a mandatory field. The options are Daily, Weekly, and Monthly. By default, Daily is selected.
  - Every: Specify when the run should repeat. This is a mandatory field.
  For example, if Daily is selected as Increment, enter 2 to run the scan once every two days. If Weekly is selected as Increment, enter 2 to run the scan once every two weeks. Similarly, if Monthly is selected as Increment, enter 2 to run the scan once every two months.
  - Time: Specify the time when the run should start. This is a mandatory field. Specify the time in 12-hour format.
  - Time Zone: Select a time zone form the drop-down list.
  - Starting: Specify the day when the schedule should start. This is a mandatory field. By default, Today is selected. To specify a particular start date, select On this date, click the calendar icon, and select the date.
  - Ending: Specify the day when the schedule should end. This is a mandatory field. By default, No End is selected. To specify a particular end date, select On this date, click the calendar icon, and select the date.
  - Automatic Scan Pause - Use this switch to schedule the time when a scan will pause. For example, you should pause all scans (by using the automatic scan pause) during working hours so they do not affect production servers. Next, use the Time Zone and Select days and time controls to set the day(s) and time when the scan should pause.
Click Save to complete adding the scan.

As a result, the newly created scan appears on the Scans page. By default, scans are displayed in alphabetic order by name. Depending on the number of entries per page, you might need to navigate to other pages to view the newly created scan. By default, the Status of a newly created scan is Unscanned.

Note

If your CM system clock does not match the Agent's system clock, your scans will not run as scheduled, so it is highly recommended to set up a NTP server to synchronize the clocks. This can be achieved in CM through the Admin Settings -> System -> NTP. For details, refer to the Thales CipherTrust Manager Administrator Guide.

Target Format Limitations

What is a valid scan target depends on the data store type. In this section we give you a few tips to have in mind.

Database data sources

When adding scan targets for database data sources (IBM DB, Oracle, andMS-SQL):

Note that table names are case sensitive but schema names are not case sensitive.
Oracle data stores accept only tables as scan targets.
IBM DB and MS-SQL data stores accept schemas or tables as scan targets.
For Oracle and IBM DB2 it is recommended to set the path in uppercase if the database is configured as case-insensitive.

Cloud data stores

For Hadoop and AWS S3 type data stores, you can configure a scan to use a specific file as a scan target.
For Azure Blob type data stores you can only specify containers as scan targets.
Salesforce data stores:
- Filters are not supported for Salesforce data stores.
- You can use the following syntax for the Salesforce target path:
  - Standard Object: s/<object API name>
    Example: s/Account
  - Custom Object: c/<object API name>
    Example: c/Account__c
  - Big Object: b/<object API name>
    Example: b/Account__b

Big Data stores

Due to known Teradata limitations Data Discovery cannot scan the following Teradata internal databases:

SYSJDBC
All
TD_SYSXML
DBC
TDStats
TD_SYSGPL
PUBLIC
SQLJ
SYSBAR
Default
SYSLIB
TD_SYSFNLIB
LockLogShredder
tdwm
TDPUSER
External_AP
EXTUSER
dbcmngr
SystemFe
SysAdmin
TDMaps
TDQCD
Crashdumps
Sys_Calendar
viewpoint
TD_SERVER_DB
console
SYSUDTLIB
SYSUIF
SYSSPATIAL

Office 365 OneDrive for Business

It is not possible to scan all groups from the root location. This is because accounts are located in multiple groups, and scanning all from the root would result in scanning many locations multiple times. For this reason, the user is required to at least specify groups for the scan.

Exchange Server

You have to specify a scan target path as scanning of an entire datastore is not supported.

Office 365 Sharepoint Online data stores

In case of Office 365 Sharepoint Online type data stores, you need to understand how resources in a Office 365 Sharepoint Online storage are organized and managed.

For sites, /:site gets appended whenever a root site collection, non-root site collection and sub-site locations are added. The location can be probed without explicitly adding /:site in the path field.

Every site collection has List and File folders and to access their content /:site/:list and /:site/:file should be used respectively.

Use the following formats to create your desired scan target paths:

All lists
<web_application_url>/<site_collection>/:site/:list

e.g.:

http://xxxxxx/testdata/:site/:list

A list
<web_application_url>/<site_collection>/:site/:list/<list>

e.g.:

http://xxxxxx/sites/test/:site/:list/Site Pages

All files
<web_application_url>/<site_collection>/:site/:file

e.g.:

http://xxxxxx/testdata/:site/:file

A folder
<web_application_url>/<site_collection>/:site/:file/<folder>

e.g.:

http://xxxxxx/testdata/:site/:file/SharedDocuments

A file
<web_application_url>/<site_collection>/:site/:file/<file>

e.g.:

http://xxxxxx/sites/test/subsite1/:site/:file/EHIC.rtf

A file in a folder
<web_application_url>/<site_collection>/:site/:file/<folder>/<file>

e.g.:

http://xxxxxx/testdata/:site/:file/Shared Documents/cards/Amex.odt

http://xxxxxx/testdata/:site/:file/Shared Documents/2001P11.pdf

Running Scans

To run a scan manually:

Navigate to the Scans screen (Data Discovery > Scans).
Search for the scan to run.
Use the Search text box to filter scans. Search results display scans that contain specified text in their names.
By default, scans are listed in ascending alphabetic order of their names.
Tip
Scans can be sorted by their name, last scan time, duration, and status.
Move the mouse pointer to the row that contains the scan. The Run Now button appears. This button disappears as soon as the mouse pointer is moved out of the row.
Click Run Now.

As soon as the scan is initiated, its status changes to Pending, then the status changes to Processing. If the automatic scan pause is configured for the scan and you are running it within the set time window, the status of that scan will be Autopaused throughout the duration of the time window. After that, the scan is resumed. For more details on the scan auto pause feature, refer to the information in Schedule Scan.

Scan Statuses

The status of the scan changes in the sequence: Unscanned > Validating > Pending > Running now / Paused / Stopped > Processing > Completed / Failed.

The progress of the scan (i.e. its current status) is displayed in the Status column in the Scans screen. See the table below for information on the possible statuses:

Status	Description
Unscanned	By default, the Status of a newly created scan is Unscanned.
Validating	Checking if all the data stores are ready.
Pending	Scan is pending and the linked data stores are being contacted. Depending on factors such as the network connectivity, this stage may: • Complete in a flash. You may not see it on the Scans page. • Remain for some time in this state.
Running now, Paused, Stopped	Scan is running, is paused, or is stopped. See Scan Progress below for more information.
Autopaused	Scan is paused as a result of automatic scan pause.
Processing	Scan is processing the collected data.
Classifying	Sensitive data objects found during the scan are being classified and remediated in this phase (if remediation is applicable).¹
Reclassification Failed	The reclassification process failed so no data is remediated.
Completed	Scan was successful.
Failed	For some reason, the scan failed. Hover the mouse over the "Failed" icon to learn more about the reason why it failed.

Note

DDC will always select an agent for every data store when the scan execution begins. It could be the same agent as the previously assigned one, or a new one, regardless of the health status of the assigned agent.

Scan Progress

For ongoing scans, their progress in percentage is displayed in the form of a progress bar accompanied by a numeric percentage value. This information is available for the "Running now" and "Paused" scan statuses.

Additionally, when you click the magnifying glass on the right of the scan's progress bar, a pop-up window is displayed with detailed information about the scan progress.

The Scan in progress pop-up shows the percentage of regular expressions processed by the scan (to check the scan path) in the Regex Process progress bar.

The pop-up also shows up to 5 un-finished scan paths per Hostname/IP ("...% completed") with detailed information displayed in a table, in these columns:

Data Store - the identifier of the data store scanned. In most cases it coincides with data store's Hostname/IP, but depending on the data store type it might show some other ID.
Paths - the currently scanned scan path/paths.
% completed - the percentage value of completion of the current scan path.
Matches - the number matches (i.e. sensitive items) found in the current scan path.

Potential Problems When Running Scans

Ready/Not Ready data store: A scan cannot run unless there is an identified Agent for every data store included in the scan. Such a data store has the status Ready. A scan that has at least one data store that is Not Ready will fail to run, and display an error. If more than one data stores associated with a scan are Not Ready the system will fail on the first data store that is Not Ready and will not check the remaining data stores.
Disabled/Enabled data store: You can manually deactivate a data store. Such a data store has a status Disabled and it will not be scanned. A scan that has several data stores associated will still run (without an error) even if one or more data stores are Disabled as long as at least one data store is enabled, but it will only scan the enabled data stores. A scan with all data stores Disabled will not run at all.
Hadoop file access rights: You get a "data store path not accessible" error when scanning a Hadoop data store that has a Hadoop file configured as its scan target, if you do not have access rights to that file.
IBM, Oracle and MS-SQL - empty table or schema: You get a "table or schema not accessible" error when scanning an empty table or schema.
IBM, Oracle y MS-SQL - case sensitive table name: In these data stores database schema names are not case sensitive, but table names are case sensitive.
Scan results that exceed the limit on the amount of information to display may fail with the error "Too many sensitive Data Objects found": In such cases, it is recommended to split the scan into smaller scans.
Scanning a Gmail label did not find any results: - You can scan emails in a Gmail label if you move the emails to the label - otherwise, they will be kept in your inbox. For the default system labels, Gmail creates some folders that do not match the label name. Please refer to the Gmail documentation to learn the right path to scan a particular system label.
Text files as BLOBs in Oracle - DDC will not be able to scan any text file stored as BLOB in Oracle if the file size is greater than 4 KB.

Scanning a MongoDB with GridFS

There are several known issues specific to MongoDB while scanning GridFS database:

You cannot specify a GridFS database collection in the scan path, even if, by default, the scan path for MongoDB is of the format <database/collection>. If you do, the scan will fail with the error: "Wrong database collection in target path". Instead, you should only use the <database> in the scan path.
A scan on a GridFS database only accepts two default collections with a bucket named fs: fs.files and fs.chunks. If you use another prefix/bucket name, it will not be scanned.
Scanning with multiple agents will not work. If you run a scan with multiple agents, then:
- On a full data store scan - the GridFS database will be skipped and it will not be seen in the report under the data object list, and the user will not be given a hint as to what happenned.
- When you specify the GridFS database in the scan path - the scan will fail with the error: "Scan results could not be found".
If files with the same name and same/different content are inserted in a GridFS database and scanned, the number of matches gets added up and listed once in the report, under one filename.
In a report for a scan on a GridFS database, the list of data objects can contain both collections and files.

Editing Scans

To edit a scan:

Log on to the DDC console.
Open the Data Discovery application.
In the left pane, click Scans. The Scans page is displayed. This page lists available scans.
Search for the scan to edit.
Use the Search text box to filter scans. Search results display scans that contain specified text in their names.
By default, scans are listed in ascending alphabetic order of their names.
Scans can be sorted by their name, last scan time, duration, and status.
Click the overflow icon () corresponding to the desired scan. A shortcut menu appears.
Click View/Edit from the shortcut menu.
The selected scan is displayed, with its configuration settings distributed over these sections (which are exactly the same as the steps of the Add Scan wizard):
- GENERAL
- DATA STORES
- TARGETS
- CLASSIFICATION PROFILES
- APPLY FILTERS
- SCHEDULE
For more details on these sections, refer to the Adding Scans section.
Click Expand All to expand all sections or a plus button (+) in the section in which you want to edit the scan configuration to expand just that section. For information on the available settings, refer to Adding Scans.
Make the desired changes and click Save Changes to save the changes.
When you edit a scheduled scan that was disabled, it gets automatically enabled.
When you edit a scan, you must run it again to see the corresponding report.

Viewing Scan Log

You can download and view a log of a selected scan if it has "Trace Logs" enabled in the advanced scan configuration settings. For more information, see the "Advanced Settings" section of General Info.

In the Scans screen, click the overflow icon () corresponding to the desired scan. An overflow menu is displayed, with the Download Logs option available.
Click Download Logs in the menu. A dialog box with information "Download logs? Logs for scan "Xyz" are available for downloading." is displayed.²
Click the Download button in the dialog box to confirm the download.

Note

If the selected scan does not have logging enabled, you will see this information upon clicking the Enable Logs option:
"You need to enable trace logs in advanced configuration and run the scan "Xyz" again to download logs."

The information written to the log has the following format:

Parameter	Data Type	Description
timestamp	number	Time stamp (Unix time format) for each action that happened on a path or location during a scan.
action	string enum: source, opening, opened, parsing, decoding, decoded, completed, scanning, inaccess	Action performed on a path or location during a scan.
agent_name	string	Name of the Agent that performed the scan.
path	string	Full path where the action happened.

Removing Scans

In the Scans screen, use the Search text box to filter scans and search for the scan that you want to remove.
Click the overflow icon () corresponding to the desired scan. An overflow menu is displayed, with the Remove option available.
Note
The Remove option is not always available in the menu, only if a scan is Failed, Completed, Stopped, or Disabled.
Click Remove in the menu. As a result, a warning message "Remove Scan? Are you sure you want to remove this scan?" is displayed.
Click the Remove button in the warning message window to confirm the removal of the selected scan.

Using Optical Character Recognition in Scans

DDC features Optical Character Recognition (OCR) on a number of image file formats. The formats that can be recognized are JPG / JPEG, BMP, PNG, GIF, TIFF, and PDF that contains any of these image formats.

Note

OCR scans will usually have a lower accuracy than raw text data scans. They may not always recognize all characters in an image due to multiple factors such as poor image quality, unusual fonts, and complex layouts. This may cause unexpected data object matches.

OCR Caching

The DDC scanning engine caches the result of OCR on an image within a scan, which can then be reused if the same image is later found in multiple locations within the same scan, for example, when scanning data sources like email in which identical images frequently occur in different email messages.

OCR Limitations

The OCR mechanism employed by DDC has the following limitations:

It cannot detect handwritten information - only typed or printed characters.
It does not find information stored in screenshots or images of lower quality. The images you scan with OCR enabled must have a minimum resolution of 150 dpi (300dpi or higher is recommended).

At the same time, the accuracy of scans involving OCR will depend on:

The quality of the image. Any noise in the image such as scanner marks, lines or soft color tones, dust from scanned images, etc.
The format of the image. Some image formats will result in better detection rates (lossless vs lossy compression).
Font face, font size and context stored in the image. Fonts within scanned images must be at least 10pt in size. Fonts below that size will not be reliably detected. Abnormally styled fonts may not be clear or consistent.
Note
OCR is not supported for HP UX 11.31+ (Intel Itanium) and Solaris 9+ (Intel x86) operating systems.

Remediation

CipherTrust Intelligent Protection (CIP) solution allows customers to discover and protect their sensitive data. To use this feature DDC needs to work with CipherTrust Transparent Encryption (CTE) and requires a CTE Agent installed alongside a DDC agent on a data store to be monitored. You have to configure a GuardPoint on the data store with which you want to use remediation. For a detailed procedure, refer to the "Managing GuardPoints" topic on the CipherTrust Platform Documentation Portal.

With the help of CTE the security issues found during a scan are encrypted and the risk is thus remediated. The results of this remediation action can then be viewed in the report. For details about remediation information, refer to Remediation Information.

Currently, remediation is only supported on local storage and network type (NFS and SMB/CIFS share) data stores. Remediation will only work if there is a CTE Agent installed and a GuardPoint configured on a data store to scan. Remediation can be executed as part of the scan or can be also launched for scans already previously run without remediation by clicking the Reclassify option in the overflow menu (see Reclassifying Scans).

Various remediation actions are possible for the Data Objects that are found containing sensitive information. For more information on the available remediation options, refer to the "CipherTrust Intelligent Protection" documentation in the Thalesdocs portal.

The diagram below illustrates how the components of the remediation solution interact with one another.

Remediation Components

A data store properly configured for remediation should meet the following conditions:

The DDC and CTE agents are installed on the same machine, in the same data store,
The GuardPoint (GP) path matches exactly the target path defined. This is to avoid a situation when a GP is configured at the root level, and some sub-paths are added, these sub-paths cannot be remediated. It is mandatory to have the GP in the sub-path that you want to remediate.

For more information on the CTE Agent and configuring GuardPoints, refer to Managing GuardPoints on the CipherTrust Platform Documentation Portal.

How Does Remediation Work in DDC?

When you try to enable remediation for a target in a scan, DDC has to perform some checks:

Check if there is a CTE agent available:
- If it is not, the toggle switch for remediation will become inactive with a message "No CTE Agent" when hovering the mouse over the switch.
- If it is available but the CTE Agent or GuardPoint is disabled, the remediation switch shows this message on mouse over: "CTE Agent or GuardPoint disabled".
If first check passes, then an additional check is performed to retrieve the GuardPoint for the scan target that you entered. If no GuardPoint is retrieved then the toggle switch for remediation is blocked and a message "Outside GuardPoint" is displayed if you hover the mouse over the switch.
Finally, after both checks have passed, the enable remediation toggle switch is enabled and you can choose whether to activate remediation for the target path entered or not.

Note

Remediation is a process performed as part of the scan when it is run. Alternatively, remediation can be performed for scans already executed by clicking the "Reclassify" option in the overflow button. For more information, see Reclassifying Scans.

After you have saved a scan with at least one target to be remediated, you can later disable remediation. In this case, you will get a warning message indicating "Previous remediation actions will not be reverted". This means, that whatever remediation action triggered as a result of the scan results will not be reverted, but just this scan will not launch new remediation actions even if future scans find sensitive matches in this target.

Remediation Messages

The various messages displayed on mouse over on the inactive "Enable Remediation" switch provide you the information on the reason for it being inactive, such as:

"No CTE Agent installed in the host" - There is no CTE Agent installed on the data store. Install a CTE Agent and configure a GuardPoint for the target. Only applicable to local storage type data stores.
"Outside of a GuardPoint" - A CTE Agent is installed but there is no GuardPoint configured for the target. Go to the Transparent Encryption application and configure a GuardPoint for the data store or target.
"CTE Agent or GuardPoint disabled" - In this case, the CTE Agent and/or the GuardPoint exist but are disabled.

Additionally, this message appears after switching off the "Enable Remediation" toggle for a previously enabled target:

"Previous encryption is kept. Remediation won't be updated."

Reclassifying Scans

Reclassifying scans allows you to remediate scans that were already executed, successfully completed, and where remediation was not previously applied without the need to run them again.

Due to a known limitation reclassifying scans is only available for local storage data store scans run in CM 2.7 and above and for network storage data store scans run in CM 2.9 and above.

Note

Before reclassifying a scan, make sure that all the data stores included in the scan are compatible with CipherTrust Intelligent Protection (CIP).

In the Scans screen, when remediation is enabled for a target inside a scan, click the overflow button. The Reclassify option is available.
Click Reclassify in the menu. As a result the scan will run the classification process and change the scan status to "Classifying".
Tip
Reports containing a scan where reclassification was launched, must be generated by using the Generate option if you are to see the updated results.

Scan Filter Usage

This section provides you with a more in-depth information on scan filters with some examples of their usage. For more examples, refer to Scan Filters.

Exclude location by prefix

The Exclude location by prefix filter is used to exclude search locations with paths that begin with a given string. Can be used to exclude entire directory trees. For example, exclude all files and folders in the c:\windows\system32 folder.

API filter name: exclude_prefix

Parameters: Expression - mandatory via UI.

Note

API: Without any expression, the default expression is "*" (that is to exclude all prefixes. In this case nothing is scanned).

Errors: "Expression field required" inline error if you don't type in any text.

Examples:

With expression "data", the filter takes into account the prefix started by "data" like "dataset.txt" or similar.
You can use the asterisk "*", a wildcard character that matches zero or more characters in a search string, and the question mark "?", a wildcard character that matches exactly one character. ??? matches 3 characters. If placed at the end of an expression, ? also matches zero characters.
- File* - Excludes all files beginning with "File"
- /home/my folder/File* - Excludes all files beginning with "/home/my folder/File"
- /home/my folder/File*2021 - Excludes all files beginning with "/home/my folder/File" + something + "2021" like "/home/my folder/File2021", "/home/my folder/File_2021", "/home/my folder/File 2021.csv"

Considerations: If you use the filter by expression and refers to a table name, the scan will exclude that table or the columns whose name matches with the filter.

Exclude location by suffix

The Exclude location by suffix is used to exclude search locations with paths that end with a given string. For example, entering led.jnl, excludes files and folders such as canceled.jnl, totaled.jnl.

API filter name: exclude_suffix

Parameters: Expression - mandatory via UI.

Note

API: Without any expression, the default expression is "*" (to exclude all suffixes. In this case, nothing is scanned).

Errors: "Expression field required" inline error if you do not type in any text.

Examples:

With an expression "txt", the filter takes into account the suffix ended by "txt" like "dataset.txt" or similar.
We can use the "*"
- txt - Excludes all files ending with "txt"
- *txt - Excludes all files ending with "txt"
- in*txt - Excludes all files ending with "in" + something + "txt" like "information.txt", "in.txt", "data_info.txt"
- data.??? - Excludes all files ending with "data" + 3 characters like "data.txt", "data.doc", but does not exclude "data.go" or "data.docx".

Considerations: If you use the filter by expression and refers to a table name, the scan will exclude that table or the columns whose name matches with the filter.

Exclude locations by expression

The Exclude locations by expression filter is used to exclude search locations by expression. The syntax the of the expressions you can use are as follows:

?: A wildcard character that matches exactly one character; ??? matches 3 characters. C:\V??? matches C:\V123, but not C:\V1234 or C:\V1.
*: A wildcard character that matches zero or more characters in a search string. /directory-name/ matches all files in the directory. /directory-name/.txt matches all txt files in the directory.

API filter name: exclude_expression

Parameters: Expression - mandatory via UI.

Note

API: Without any expression, the default expression is "*" (to exclude all expressions. In this case, nothing is scanned).

Errors: "Expression field required" inline error if you don't type in any text.

Examples:

With expression data.txt, the filter excludes files that match exactly with "data.txt" (be careful with the path).
We can use the "*"
- *data.txt - Excludes files that end by "data.txt" in any path.
- data - Excludes files that match with anything + "data" + anything, like "/home/my dir/data", "/data.txt", "C:
- my folder\data1
- my sensitive file.txt"
- data.txt* - Excludes files that start with "data.txt" in any path.
- *data.??? - Excludes all files ending with anything + "data" + 3 characters like "data.txt", "/home/data.txt", "C:
- data.txt", "data.doc", but does not exclude "data.go" or "data.docx".

Considerations: If you use the filter by expression and refers to a table name, the scan will exclude that table or the columns whose name matches with the filter.

Include locations modified recently

The Include locations modified recently filter is used to include search locations modified within a given number of days from the current date. For example, enter 14 to display files & folders that have been modified not more than 14 days before the current date.

API filter name: include_recent

Parameters: Days from current date - integer number up to 99 - mandatory

Errors: days missing/wrong param for include_recent filter → "message": "Invalid number of days"

Examples: Filter value: 5 → The filter includes files and folders that have been modified not more than 5 days before the current date.

Exclude locations greater than file size (MB)

The Exclude locations greater than file size (MB) filter is used to exclude files that are larger than a given file size (in MB).

API filter name: exclude_max_size

Parameters: MB: integer number equal or greater than 1 MB - mandatory

Errors: size missing/wrong param for exclude_max_size filter - "message": "Invalid max size: " / "message": "Invalid max size: 0"

Examples: Filter value: 15 - Exclude files that are larger than 15 MB

Note

In the case of AWS S3, ".zip files" are treated as folders by the scan agent. Hence, ".zip files" that are larger than the size specified in the exclude_max_size filer are not actually excluded.

Include locations within modification date

Description: Include search locations modified within a given range of dates. Prompts you to select a start date and an end date. Files and folders that fall outside of the range set by the selected start and end date are not scanned.

API filter name: include_date_range

Parameters:

Start date - mandatory
End date - mandatory

Errors:

to_date and from_date missing/wrong param for include_date_range filter - "message": "Invalid start date"
to_date missing/wrong param for include_date_range filter - "message": "Invalid start date"
from_date missing/wrong param for include_date_range filter - Be careful! - "message": "Invalid start date"

Examples:

If you set a date with some text before the format <YYYY-MM-DD>, i.e. "2021-05-21 kjsf" or "2021-05-21 14:23", then only is taken the match "2021-05-21"
If the to_date param is greater than from_date no error is returned.

Limitations: For datastores like Databases, Exchange Online, G-Mail, etc, it seems that the filter by date works for folder and files, but not for databases or email.

Additional Considerations With Relation to Data Store Types

Databases

Be careful with the expression when you try to exclude some objects like tables or schemes. For example, if you want to exclude a specific table in MSSQL, you can use a filter like mydb:1433/myschema/mytable, taking into account the database, the schema and the table.
In MongoDB if you want to skip one table you have to put an star at the beginning of the table like so:
"contacts"
to take into account the table or if not, to specify the full path:
"sensitive-data:27017/contacts/"
(specifying the database and the port). If you only put "sensitive-data:27017/contacts" the filter does not work. The column filter does not seem to work on MongoDB. This limitation is only applicable to the exclude_expression filter.

Filter Columns in Databases

You can filter out columns in databases by using the "Exclude location by suffix" filter to specify the columns or tables to exclude from the scan.

Description	Syntax
Exclude specific column across all tables in a database.	<column name> Example: To filter out "columnB" for all tables in a database, enter columnB.
Exclude specific column from in a particular table.	<table name>/<column name> Example: To filter out "columnB" only for "tableA" in a database, enter tableA/columnB.

Note

Filtering locations for all Target types use the same syntax. For example, an "Exclude location by suffix" filter for columnB when applied to a database will exclude columns named columnB in the scan. If the same filter is applied to a Linux file system, it will exclude all file paths that end with columnB (e.g. /usr/share/columnB). Use the Apply to field if the global filter only needs to be applied to a specific Target Group or Target.

Database Index or Primary Keys

Certain tables or columns, such as a database index or primary key, cannot be excluded from a scan. If a filter applied to the scan excludes these tables or columns, the scan will ignore the filter.

File Systems

Regarding the "Include locations modified recently" and "Include locations within modification date" filters, both ranges are taken into account. For example:

We have a file edited on 20th August 2021 and another one edited in November 2019. Then we add the filters on 20th August:
- "Include locations modified recently" - 4 Days from the current date,
- "Include locations within modification date" - Start: 15th Oct 2019 - End: 17th Aug 2021.
Then both files are taken into account. Both these filters work with the conjunction of the elements.

Scans with remediation can be launched in parallel, but only one scan at a time can perform remediation. This means, that all parallel scans with remediation will wait in the Classifying phase, and they will be remediated one after another. After the first scan that started to remediate finishes, the next scan waiting in the queue will start, and so on. ↩
There are four posibilities when it comes to downloading logs for a scan:
Case 1: If you created the scan with "Trace Logs" enabled with a "Completed" status, then you will see the "Download Logs" option. When you click the option, this popup will appear: "Logs for scan "Xyz" are available for downloading."
Case 2: If you created the scan with "Trace Logs" disabled with a "Completed" status, then you will see the "Enable Logs" option. When you click the option, the following message appears: "You need to enable trace logs in advanced configuration and run the scan "Xyz" again to download logs."
Case 3: If you first create the scan with "Trace Logs" disabled with a "Completed" status, the "Enable Logs" option will appear. When you select the option, a popup displaying Case 2 appears. After that, if you enabled "Trace Logs" in the scan View/Edit section and did not execute the scan again, you will see the "Download Logs" option with this popup: "Logs are not available because scan "Xyz" has never been executed with trace logs option enabled. Run the scan to generate them."
Case 4: If you first create the scan with "Trace Logs" disabled and a "Completed" status, the "Enable Logs" option will appear. When you click the option, you will see a popup with Case 2. After that, if you enabled "Trace Logs" in the scan View/Edit section and executed the scan with a "Completed" status, you will see the "Download Logs" option with this popup: "Logs for scan "Xyz" are available for downloading." ↩

Suggest A Change

Managing Scans

Viewing Scans

Adding Scans

General Info

Select Data Stores

Add Targets

Select Profiles

Apply Filters

Schedule Scan

Target Format Limitations

Office 365 Sharepoint Online data stores

Running Scans

Scan Statuses

Scan Progress

Potential Problems When Running Scans

Scanning a MongoDB with GridFS

Editing Scans

Viewing Scan Log

Removing Scans

Using Optical Character Recognition in Scans

OCR Caching

OCR Limitations

Remediation

How Does Remediation Work in DDC?

Remediation Messages

Reclassifying Scans

Scan Filter Usage

Exclude location by prefix

Exclude location by suffix

Exclude locations by expression

Include locations modified recently

Exclude locations greater than file size (MB)

Include locations within modification date

Additional Considerations With Relation to Data Store Types

Databases

Filter Columns in Databases

Database Index or Primary Keys

File Systems

On this page