Classify
Define a policy to discover and label sensitive data including personally identifiable information, credentials, and even custom regular expressions inside text, logs, and other structured data.
The
classify
API policy structure has two notable sections. First, the models
array will have one item that is keyed by classify
.Within the
classify
object:- A
data_source
is required- This parameter can be overloaded via the command line interface (CLI)
- At this time,
csv
and plain-text data formats are supported.
- A
labels
array is required to specify named entities to search for, including:- Namespaces for custom regular expressions (optional)
schema_version: "1.0"
name: "my-awesome-model"
models:
- classify:
data_source: "_"
labels:
- person_name
- credit_card_number
- phone_number
- us_social_security_number
- email_address
Within the config, you may optionally specify a
label_predictors
object where you can define custom predictors that will create custom entity labels.This example creates a custom regular expression for a custom user id format:
schema_version: "1.0"
name: "classify-my-data"
# ... classify model defined here ...
label_predictors:
namespace: acme
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"
If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.
regex
: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:score
: One of high, med, low. These map to floating point values of .8, .5 and .2 respectively. If omitted the default is high.regex
: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.
In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label
acme/user_id
will be created when a match occurs.You can now combine the label_predictors with your classify policy. For example:
schema_version: "1.0"
name: "my-awesome-model"
models:
- classify:
data_source: "_"
labels:
- acme/*
label_predictors:
namespace: acme
regex:
user_id:
patterns:
- score: high
regex: "user_[\\d]{5}"
Adding
use_nlp: true
to a classification model will enable entity predictions using natural language models.schema_version: "1.0"
name: "nlp-model"
models:
- classify:
data_source: "_"
use_nlp: true
labels:
- person_name
- location
Enabling this feature may be useful if you work with unstructured data and need to label names or locations such as addresses, states, or countries.
Enabling NLP predictions may decrease model prediction throughput by up to 70%.
Gretel currently uses spaCy for making NLP predictions. The following entity types are supported from the model:
person_name
Predictions produced by the spaCy model will be tagged with the source
gretel/spacy.
What kinds of entities can the Classify API detect?
The Classify API can detect 40+ entity types including names, addresses, credentials, and other identifiable information. Check out the full list.
Last modified 4mo ago