Prepare the IDM (Enhanced) Fingerprint File

Last updated
Save as PDF

Fingerprint files are created when you train the data source file using the DLP Integrator, which includes the IDMTrain tool. As a prerequisite for index document matching, the data source file must be trained using the IDMTrain tool to generate the .db file.

All values in the data source file are normalized and hashed in the fingerprint file, regardless of the definition you use in classifications.

Install the DLP Integrator

To use IDM, you need to install DLP Integrator v.6.4.0, which includes the IDMTrain tool, and is supported on both Windows and Linux platforms.

For more information, see:

Generate the database (.db) files using the IDMTrain Tool

You can use the command line interface (CLI) to run the idmtrain command with these options to train the data source file.

Command Line Options

CLI Option - Short form	CLI Option - Full form	Description
`-?`	`--help`	`Shows the IDMTrain tool help and exit`
`-v`	`[ --verbose ]`	`Shows the verbose output`
`-V`	`[ --version ]`	`Display version information and exit`
`-q`	`[ --quantity ]`	`Extra initial dummy train for quantity of files`
`-A`	`[ --all-files ]`	`Process all files (including hidden files)`
`-E`	`[ --no-errors ]`	`Specifies not to generate error messages or enforce thresholds`
`-W`	`[ -no-warnings ]`	`Specifies not to generate warning messages`
`-j`	`[ --json ] file`	`Output progress and exit status to file as JSON`
`-r`	`[ --report ] file`	`Output training information to file as JSON`
`-o`	`[ --output ] file`	`Resultant database to create`
`-e`	`[ --errors ] % (=5)`	`Specifies the error threshold in percentage`
`-D`	`[ --db-name ] name`	`Specifies the database name (default based on output file name)`
`-x`	`[ --exclude ] pat ...`	`Exclude files with case insensitive MS-DOS pattern`
`-p`	`[ --progress ] [=secs(=2)]`	`Shows the progress after the specified interval`
`-@`	`[ --options ] file ...`	`Read positional options from the options file section below`
`-a`	`[ --age-whole ] days`	`Use whole file matching only for files that are older than this many days`
`-s`	`[ --size-whole ] sigs`	`Use whole file matching only for files with this many signatures`
`-t`	`[ --type-whole ] pat ...`	`Use whole file matching only for files with MS-DOS pattern`

Options File

An example for running the training tool with the command:

idmtrain -p -q -r c:\idm\out\traininfo.json -o c:\idm\out\fingerprint.db -C ee9a4c72-81ff-479d-a493-1104d37100ea -R c:\idm\small_samples\

is equivalent to:

idmtrain -p -q -r c:\idm\out\traininfo.json -o c:\idm\out\fingerprint.db -@ c:\idm\folders.txt

where c:\idm\folders.txt contains:

-C ee9a4c72-81ff-479d-a493-1104d37100ea -R c:\idm\small_samples\

Here the options file can have many lines up to a limit of 16 MB. When using the options file, wild card expansion that would occur via the shell is not performed. To use UTF-8, rather than the native character set, save the file with a UTF-8 BOM (byte order mark).

Whole File Matching

Whole file matching is a technique used in the Skyhigh Fingerprinting process to match files exactly based on their content or metadata. This method allows you to match files with less than 300 characters and files with more than 300 characters within your fingerprint directories. You can use CLI options (-a, -s, and -t) to match files based on various parameters, such as file type, age, and signature for whole file matches. For example, you can use whole file matching to match files older than 30 days in your fingerprint directories during the IDM training process with the following command:

idmtrain -p -q -r c:\idm\out\traininfo.json -o c:\idm\out\fingerprint.db -C ee9a4c72-81ff-479d-a493-1104d37100ea -R c:\idm\small_samples\ -a 30

Position Dependent Options

CLI Option - Short form	CLI Option - Full form
`-I`	`[ --ignore ]`	`Train following paths to ignore rather than classify`
`-C`	`[ --class ] guid ...`	`Train following paths with these classifications`
`-P`	`[ --path ] path ...`	`Train files directly under these paths`
`-R`	`[ --rpath ] path ...`	`Train files recursively under these paths`

In a standard run, the following points should be noted.

Hidden files will not be trained. Hidden files on Windows have the hidden attribute set or start with ".", on Linux files starting with "." are hidden. Also on all OSs __MACOSX folders will not be trained. The command line -A or --all-files option will train hidden files.
Directory symbolic links will not be followed.
Whole file matching will not trigger matches for Microsoft files with additional metadata. If you use CLI options such as -a, -s, and -t for whole file matches during IDM training, matches will not be triggered for some Microsoft file formats stored on Microsoft SharePoint. This issue occurs because Microsoft modifies the file's metadata upon upload to SharePoint, which alters the file's signature.

NOTE: If -r is not specified then warnings/errors will go to stderr and there will be a completed message. If -p is specified then there will be progress output that will go to the JSON file if -j is used. The -q option is required for percentage progress.