About ML-driven Auto Classifiers

Last updated
Save as PDF

Advanced DLP: ML Auto Classifiers (an advanced DLP capability) require additional entitlement. Contact Skyhigh Support or your account manager for assistance.

Auto Classifiers, powered by Artificial Intelligence (AI) and Machine Learning (ML), is an Advanced DLP feature. These data classifiers automatically detect and categorize sensitive files based on Skyhigh pre-trained AI and ML models. Skyhigh uses two types of ML Auto Classifiers, text and image classifiers to identify various types of sensitive files in sanctioned and shadow/web services. With ML Auto Classifiers, you can detect sensitive files such as financial reports and statements, patient records, patents, source code, and ID files in various formats. For details on the supported file formats, see Supported File Formats.

NOTE: ML Auto Classifiers are supported for files uploaded to sanctioned or shadow/web services and are not supported for email bodies, email headers, or other content types. For details, see FAQs.

ML Auto Classifiers provide a quick and effective method to discover sensitive data in real time, empowering organizations with granular DLP policy controls. They simplify identifying and classifying sensitive data and are helpful for administrators unfamiliar with complex DLP rules such as regular expression, dictionaries, and more. This automated data classification approach can minimize the chances of inaccurate matches that could lead to false positives and negatives, making it an effective way to ensure accurate results.

Security Operations Center (SOC) analysts can gain insights into the matches for ML Auto Classifiers, which enables them to reduce their investigation time and efficiently respond to data loss incidents. These capabilities allows SOC analysts to identify and mitigate potential security threats within their organization proactively.

► Advantages of ML Auto Classifiers

AI-ML Powered Automatic Data Classification. Automatically discovers and classifies files with sensitive data such as PII, financial records, healthcare records, and intellectual property using AI and ML models.
Comprehensive Categorization. Utilizes AI and ML to automatically categorize data across all exfiltration vectors, enhancing data governance.
Robust Policy Framework. Leverages the categories and subcategories for ML Auto Classifiers within the policy framework to build robust DLP policies.
Simplified DLP Administration. Streamlines DLP management by eliminating the need for manual data classification.
Enhanced Operational Efficiency. Significantly boosts operational efficiency in incident management.
Scalability. Provides flexible scalability to support large data volumes across standard file formats.
Confidence. Offers clear insights into the confidence percentage in data classification, reducing the risk of data breaches.
Risk Reduction. Minimize the risk of inaccurate matches, preventing false positives/negatives.

IMPORTANT: Skyhigh Security does not use your confidential data to train its AI and ML models for ML Auto Classifiers.

For example, a Security Operations Center (SOC) may want to restrict the upload of sensitive ID files such as passports to Google Drive. To achieve this use case, the SOC must first create a classification using the ML Auto Classifier condition and select the PII category including the ID Documents (Image) subcategory for ML auto classifier on the Classifications page. Subsequently, the SOC can use the newly created ML Auto Classifier classification in their sanctioned DLP policy for Google Drive to apply the block response action. This enables admins to identify and secure ID files such as passports containing PII data uploaded to Google Drive.

Getting Started

To begin using the ML Auto Classifiers capability, follow these steps:

Create a Classification. Use the ML Auto Classifier condition to create a classification. For details, see Create a Classification using ML Auto Classifier.
Establish a DLP Policy. Create a Sanctioned or Shadow/Web DLP policy using the newly created ML Auto Classifier classification. For details, see Create a Sanctioned or Shadow/Web DLP policy.
Review Matches and Confidence Levels. View the matches triggered for ML Auto Classifiers and their confidence percentages in the Sanctioned or Shadow/Web DLP Incident cloud card. For details, see Sanctioned or Shadow/Web DLP Incident cloud card.

Supported File Formats

Skyhigh Security supports the following file formats for detecting sensitive files, text and image based files, in sanctioned and shadow/web services using ML Auto Classifiers.

► Supported File Formats

File Category	File Format	MIME Type	Extension(s)
Text Files
Spreadsheet Files	Microsoft Excel	VND.MS-EXCEL	XLS, XLT, and XLA
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.SPREADSHEETML.SHEET	XLSX
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.SPREADSHEETML.TEMPLATE	XLTX
		VND.MS-EXCEL.ADDIN.MACROENABLED.12	XLAM
		VND.MS-EXCEL.SHEET.BINARY.MACROENABLED.12	XLSB
	OpenDocument Spreadsheet Document	VND.OASIS.OPENDOCUMENT.SPREADSHEET	ODS
	Comma-separated values (CSV)	Text/CSV	CSV
Presentation Files	Microsoft PowerPoint	VND.MS-POWERPOINT	PPT, PPS, POT, and PPA
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.PRESENTATIONML.PRESENTATION	PPTX
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.PRESENTATIONML.TEMPLATE	POTX
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.PRESENTATIONML.SLIDESHOW	PPSX
		VND.MS-POWERPOINT.TEMPLATE.MACROENABLED.12	POTM
		VND.MS-POWERPOINT.PRESENTATION.MACROENABLED.12	PPTM
		VND.MS-POWERPOINT.SLIDESHOW.MACROENABLED.12	PPSM
	OpenDocument Presentation Document	VND.OASIS.OPENDOCUMENT.PRESENTATION	ODP
Word Files	Microsoft Word	MSWORD	DOC, DOT
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.WORDPROCESSINGML.DOCUMENT	DOCX
		VND.OPENXMLFORMATS-OFFICEDOCUMENT.WORDPROCESSINGML.TEMPLATE	DOTX
		VND.MS-WORD.DOCUMENT.MACROENABLED.12	DOCM
		VND.MS-WORD.TEMPLATE.MACROENABLED.12	DOTM
	OpenDocument Text Document	VND.OASIS.OPENDOCUMENT.TEXT	ODT
	Rich Text Format	RTF	RTF
PDF Files	Adobe Portable Document Format	PDF	PDF
Image Files
	Standard Image Files	N/A	JPEG, PNG, TIFF, BMP, FITS, GIF, JP2, WEBP, X-DCX, X-PCX, X-PHOTO-CD, X-PORTABLE-BITMAP, X-RGB, and X-TARGA
	Adobe Photoshop Files	VND.ADOBE.PHOTOSHOP	PSD

FAQs

► Does Skyhigh Security use your confidential data to train or build its AI and ML models for ML Auto Classifiers?: Skyhigh Security does not use your confidential data to train or build its AI and ML models for ML Auto Classifiers.

► What is the character limit for scanning text-based files using ML Auto Classifiers?: ML Auto Classifiers need at least 250 characters and can scan up to 50 million characters in text-based files.

► What is the image size and image resolution required for scanning image-based files using ML Auto Classifiers?: ML Auto Classifiers need at least 1024 bytes (1 kilobyte) and can scan up to 50 megabytes (MB) in image-based files. The image resolution should be at least 200 pixels in width and height.

► What are the content types supported by ML Auto Classifiers?

ML Auto Classifiers support the following content types:

Files uploaded to sanctioned or shadow/web services
Web service submissions
Web POST bodies
Email attachments

► What are the content types that are not supported by ML Auto Classifiers?

ML Auto Classifiers do not support the following content types:

Email bodies
Email headers
Subject lines
Web headers
Images embedded in PDF when OCR is enabled

► What are the file formats supported by ML Auto Classifiers?: ML Auto Classifiers support various file formats to detect various types of sensitive files, text and image based files, in sanctioned and shadow/web services. For details, see Supported File Formats.

► Do ML Auto Classifiers send your confidential data to an external LLM (Large Language Model) based service for processing?: No, ML Auto Classifiers do not send your confidential data to any external LLM based service for processing.

► What are the AI and ML techniques used in Skyhigh pre-trained models for ML Auto Classifiers?: Skyhigh Data Scientists evaluate various ML techniques and select the one that provides the highest accuracy for each ML Auto Classifier. These methods include both multi-class classifiers and binary classifiers using various supervised and unsupervised learning techniques. Skyhigh trains and validates its models for ML Auto Classifiers using diverse datasets from multiple sources.

► Are ML Auto classifiers supported for Secure Web Gateway (Cloud and On-Prem) and CASB DLP policies?: Yes, ML Auto Classifiers are supported for SWG (Cloud) and CASB DLP policies but are not supported for SWG (On-Prem) DLP policies, as data classifications are not supported in SWG (On-Prem) appliances.

► Are ML Auto Classifiers supported for a Trellix ePolicy Orchestrator (ePO) integration use case by syncing your existing DLP classifications from Trellix ePO to Skyhigh?: No, ML Auto Classifiers are supported only for classifications defined in the Skyhigh console (Classifications).

► Are ML Auto Classifiers supported with Data Identifiers?: No, ML Auto Classifiers are not supported with Data Identifiers, as Data identifiers are legacy DLP features that will no longer be supported after June 2025. Skyhigh recommends using a classification-based approach for all your DLP use cases. For details, see Migration Guide for Legacy Data Identifiers.

► Can ML Auto Classifiers process large files?: Yes, ML Auto Classifiers can process large text-based and image-based files up to 50 MB for classification.

► Can you create custom ML Auto Classifiers currently?: No, you cannot create custom ML Auto Classifiers currently.

► Do ML Auto Classifiers detect classified data in all languages or only in English?: Text-based ML-Auto Classifiers such as Financial Reports/Statements, Patient Records, Patents, and Source Code classifiers can detect classified data only in English. Whereas, image-based ML-Auto Classifiers such as the ID Documents classifier can detect classified data in all languages.

► Does Skyhigh Security support sample data in the plain text and file formats uploaded to the Classification Tester for ML Auto Classifiers?: Skyhigh supports sample data in file format but does not support sample data in plain text format uploaded to the Classification Tester for ML Auto Classifiers.