About Exact Data Match (EDM) Fingerprints

Last updated
Save as PDF

Exact Data Match (EDM) Enhanced fingerprints allows Skyhigh Security Service Edge users to protect sensitive user database records in a row and column format typically extracted from a database in CSV format, build structured indexes of that data on-premises, and use DLP policies to prevent sensitive information from leaving the organization. EDM is available for both Skyhigh CASB and Secure Web Gateway web apps.

Employee records, customer records, and patient medical records are typical examples of large and sensitive information groups that need protection. Although you can protect such records by matching patterns and dictionary terms, these methods of data matching require complex conditions and rule logic, which are prone to false matching.

Matching individual fields of a sensitive record, such as name, date of birth, and telephone number might not be useful, and can easily result in a false match. But matching two or more fields of the same sensitive record (for example, both name and social security number) within the same text (such as an email or a document) indicates that meaningful related information is present.

EDM (Enhanced) fingerprints match the actual words in the defined columns within a certain proximity in a record (line or row) against the original user record, which prevents false matches.

To train your data and generate an EDM (Enhanced) fingerprint file, you'll first need to install the DLP Integrator software (v6.1.2 or later), which includes the EDMTrain tool. The tool normalizes and hashes the data source file and generates the required .props file (metadata file) and .dis file (fingerprint file).

Once your data is trained, generate the EDM (Enhanced) fingerprint file to the shared storage location and use it as classification criteria in a DLP Policy rule to match data for any violations. DLP supports scanning emails, web posts, and network traffic with EDM fingerprints.

EDM (Enhanced) supports multibyte characters using Unicode to process data in all languages including Chinese, Japanese, and Korean. EDM (Enhanced) offers improved scanning and fingerprinting performance compared to the existing EDM (Legacy) fingerprints.

NOTE: The existing Structured (EDM) and Unstructured (IDM) fingerprints are still supported, but they are indicated as Legacy in the user interface.

You can also use tools provided with the DLP Integrator software to automate EDM (Enhanced) fingerprint updates.

EDM (Enhanced) Workflow

Export user records from a database to a .csv file or .tsv file. (Alternatively, you can use the pipe command to send user records from the database to the EDMTrain tool (fingerprint tool) for processing.)
Create the EDM (Enhanced) fingerprint file and generate an index.
Define content classification criteria using the exact data matching criteria.
Create a rule set with the EDM (Enhanced) classification criteria and apply it to a DLP policy.

How EDM Works

Exact data matching (EDM) Enhanced allows associative matching of words in multiple fields (cells) of a user record and the rules are allowed based on:

Number of field (cells) matches — The number of field matches within the specified proximity constitute a record match (can appear in any order).
Proximity — The proximity is measured in terms of words. Proximity is the maximum permitted distance between adjacent fields that you can specify for field matches. You can set the proximity between 0–40 words.
Column values — Allows you to select the column values that must be found in a match and within the specified proximity. You can specify the Columns to scan for from the list of column values included in the fingerprint file. Optionally, you can specify the Mandatory columns that must be present for a record to match. You can also set Exceptions to exclude a combination of columns, which means that a match consisting of any combination of only the excluded columns isn't counted as a record match.

Select columns for content match criteria

Consider the following record from a dataset protected with EDM (Enhanced):

Full Name	Account Number	Customer number	Phone number	Email Address
John Doe	123-456-789	234567	+1-555-1234	john_doe@mycompany.example

You might decide that the data you don't want to leak is the combination of the account number with any information that can be used to identify the account owner.

If you specify the record match criteria as "at least 2 out of the 5 cell values with not more than 15 words between matched cell values", then add "Account number" as a mandatory column, any document containing the account number within the specified proximity of one or more other column values triggers a record match.

Consider:

Example A: John Doe has a telephone number +1-555-1234. He lives in New York and his account number is 123-456-789 with customer number 234567.
Example B: 123-456-789 234567

You might decide that the combination of "Account number" and "Customer number" does not constitute a leak, and is causing false positives in numeric spreadsheets, so you can create an exception for the match "Account number", "Customer number".

This means that the example B, 123-456-789 234567 is not a match.

But, those fields combined with any other cell will still be a match. Example A is still a match as "Full name" is combined with "Phone number", Account number", and "Customer number":

John Doe 123-456-789 234567

EDM boundaries

The data source file can be a .csv (comma-separated value) file or .tsv (tab-separated value) file. It isn't possible to escape the delimiter character, so the delimiter character must not appear in the data.
Automatic fingerprint creation is supported in data source files with up to 6 billion cells, subject to a maximum of 4,294,967,295 (about 4 billion user records) rows and 32 columns.
The first line of the data source file can be a header of column names, as the headers are extracted into the .props file. The -h (--header) option used in the edmtrain command indicates that the first line is a header line. If the header option isn't specified, the tool generates the output file with the column names, such as Column 1, Column 2, and Column 3.
A column name can't contain more than 100 characters.
Phone numbers in the fingerprint file only match identical numbers in text files. For example, a phone number entered in the CSV without a country code doesn't trigger a match with the same number in the text file if it includes the country code.
Exact data matching supports multibyte characters using Unicode to process data in all supported languages including Chinese, Japanese, and Korean.

Tokenize and Normalize Data

The content of the data source file is tokenized and normalized before hashed into fingerprints. The same rules are applied to the data that is scanned for exact data matching:

User data is considered equal regardless of the case. For example, DATA, Data, and data are considered the same.
Non-word characters aren't considered for matching. This includes connector punctuation (for example, underscore) unlike PCRE such that the equivalent regular expression is [\W\p{Pc}].
Japanese address area codes (chōme-ban-gō) are normalized such that, 二丁目２１番９号, ２の２１の９, and 2-21-9 are considered the same.
Applies ICU NFKC normalization (Unicode® Standard, Annex #15). For example, half-width and full-width Japanese characters are converted to their normal equivalent forms.
Applies ICU rules for text segmentation (Unicode® Standard, Annex #29) with these minor deviations, so that the text is broken into more meaningful tokens for exact data matching:
- Comma and semicolon variants are considered as word breakers when located between numbers. Also, in contrast, '/' is not considered as a word breaker when located between numbers so that dates are treated as single word.
- Connector punctuation (for example, underscore) and the NARROW NO-BREAK SPACE character are considered word breakers.
- '-', '+', and '@' variants aren't considered as word breakers when located between numbers and/or letters.
ICU rules for text segmentation with ASCII single word identifiers permit these characters and separators:
- Numeric: [0-9] separated by [./@+-]. For example, see Enhanced EDM to Support Additional Date Formats.
- Alphabetical: [a-zA-Z] separated by [.@+-]
- Alphanumeric: [a-zA-Z0–9] separated by [@+-] ('.' and '/' are permitted as mentioned if both sides are numeric, similarly for '.', if both sides are alphabetical).