Similarity search
Similarity search enables retrieving sensitive information related to DDC ML specific NER infotypes such as email addresses and GPS. See DDC ML infotypes to check the complete list of scannable NER infotypes.
Similarity search is used to find specific entities in a database, for example finding sensitive information such as names, addresses, and social security numbers using user queries. It also supports searches in multiple languages by using advanced embedding models that are trained on data from many different languages. These models can understand the meaning of terms and concepts across languages and represent them in a common space. This means that when a user searches in one language, the system can still find relevant content written in another language, as long as the information has clear linguistic meaning.
Similarity search uses advanced text vectorization models and indexing algorithms for fast and accurate information retrieval. To perform similarity search, DDC scans the data store to discover and vectorize sensitive entities. It then uses these embeddings to build a vector index to facilitate querying based on nearest neighbor algorithms. Due to the semantic nature of the search model, multilingual search capabilities are also supported.
Each infotype found in the indexed scan is compared to the user's query using a similarity metric like cosine similarity. Filters for the selected infotype, minimum similarity score, and data stores are also applied. Data objects that don't have the required infotypes (as defined in the search setup) are excluded.
The results are grouped first by data object, then by matched infotype within each object. For each group, the system reports the number of matches and the median similarity score, which are used to rank the results.
Prerequisites
Ensure that DDC ML agent is installed and configured. See DDC ML agent for more details.
Ensure you are using CipherTrust Manager version 2.20 or later.
Note
The Google GCP region associated with TDPaaS is used for onboarding MLaaS as well.
Add Local or Network data stores where you want to perform the search.
See adding Local storages and Network storages for adding local and network data stores, respectively.
Configure and execute a scan on the added data store with indexing enabled for a ML classification profile.
See Classification profile to learn how to create an ML classification profile using ML infotypes.
See Scans to learn how to execute a scan with indexing enabled.
Perform similarity search
Log on to the CipherTrust Manager console.
Click Search > Add Search.
On the General Info screen, provide the name for the search and click Next.
On the Configure Search screen, perform the following tasks:
In Subject Search, select either Full Name or Primary Email as mandatory infotype.
Enter the corresponding name or email address.
Select additional infotypes from the Add a new optional field drop-down.
From the This field is drop-down, select Mandatory or Optional for infotype type.
Enter the value for the infotype to search for.
Click Add.
Repeat the above steps to add more infotypes.
Click Next.
On the Configure Accuracy screen, provide the search accuracy threshold for different infotypes.
See Refine search result for more details.
Click Restore Defaults to restore the accuracy thresholds to the default value of 90.
Click Next.
On the Select Data Stores screen, select the target data stores where you want to search for the infotype.
Note
Currently, only Local and Network data stores are supported.
Datastore availability will depend on which scans were indexed previously.
Click Save.
DDC will perform similarity search for the selected infotypes and display results on the Search page.
Note
Similarity search depends on the named entities extracted through NER scans, so search effectiveness is limited by the accuracy of the NER process. Only the entities detected in the scan (as visible in the scan report) are indexed as potential matches for future search queries.
View search results
To view search results, click the corresponding search name. Alternatively, click the ellipsis icon (...) corresponding to the search name and select Show Results.
Edit search
You can view, edit, or rerun the search with modified parameters.
On the Search screen, click the ellipsis icon (...) corresponding to the search name.
Select View/Edit.
Expand the GENERAL section to edit the search name.
Expand the CONFIGURE SEARCH section to add or remove infotypes for searching.
Expand the DATA STORES section to add or remove data stores.
Select the Run Now check box to rerun the search after making changes.
Click Save Changes.
Delete search
To delete a previously created similarity search, click the ellipsis icon (...) corresponding to the search name and select Remove.
Refine search result
You can thoughtfully refine the search results by adjusting the threshold value and selecting the number of infotypes:
Similarity Threshold: This is a score from 0 to 100 that shows how similar the search results need to be to what you’re looking for. A higher score means you’ll get results that are very similar to your search query. Note that this score measures how related the meanings are, not just how many words match. The default settings are usually helpful, but you can adjust them to fit your needs. For example, if you want only very close matches, you can set the threshold between 95 and 100. On the other hand, cross-lingual queries should have their thresholds relaxed, such as ~40-50%, to get appropriate matches. The below table summarizes various settings for the similarity threshold.
Aspect Default Threshold Recommended Adjustments General Use 90% Standard starting point; adjust for finer control. Precision Matches 95% - 100% Higher thresholds for queries like email addresses, URLs, access keys to minimize false positives. Ambiguous Infotypes 70% - 90% Lower thresholds for personal names, mailing addresses, etc., due to variations in representation. Cross-lingual queries 40% and above Adjust based on vocabulary (lexicon) and word structure variations across languages. Mandatory vs. Optional Infotypes: This means that some types of data must be included in your results (mandatory) while others can be included if they are available (optional). According to DDC requirements, either a full name or an email must always be included. If you require too many mandatory types, it could limit your results because fewer items will meet all those criteria.
Apart from the median similarity score, getting multiple matches for a subject search across a variety of infotypes is a positive indicator of relevance and accuracy.