Search…
Model Configuration
Define a policy to discover and label sensitive data including personally identifiable information, credentials, and even custom regular expressions inside text, logs, and other structured data.
The classify API policy structure has two notable sections. First, the models array will have one item that is keyed by classify.
Within the classify object:
  • A data_source is required
    • This parameter can be overloaded via the command line interface (CLI)
    • At this time, csv and plain-text data formats are supported.
  • A labels array is required to specify named entities to search for, including:
    • Supported entities. See the full list.
    • Namespaces for custom regular expressions (optional)
1
schema_version: "1.0"
2
name: "my-awesome-model"
3
models:
4
- classify:
5
data_source: "_"
6
labels:
7
- person_name
8
- credit_card_number
9
- phone_number
10
- us_social_security_number
11
- email_address
Copied!

Custom Predictors and Data Labeling

‌Within the config, you may optionally specify a label_predictors object where you can define custom predictors that will create custom entity labels.
‌This example creates a custom regular expression for a custom user id format:
1
schema_version: "1.0"
2
name: "classify-my-data"
3
4
# ... classify model defined here ...
5
6
label_predictors:
7
namespace: acme
8
regex:
9
user_id:
10
patterns:
11
- score: high
12
regex: "user_[\\d]{5}"
Copied!
If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.
  • regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:
    • score: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default is high.
    • regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.
‌In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label acme/user_id will be created when a match occurs.
You can now combine the label_predictors with your classify policy. For example:
1
schema_version: "1.0"
2
name: "my-awesome-model"
3
models:
4
- classify:
5
data_source: "_"
6
labels:
7
- acme/*
8
9
label_predictors:
10
namespace: acme
11
regex:
12
user_id:
13
patterns:
14
- score: high
15
regex: "user_[\\d]{5}"
Copied!

Classifying Data using NLP

Adding use_nlp: true to a classification model will enable entity predictions using natural language models.
1
schema_version: "1.0"
2
name: "nlp-model"
3
models:
4
- classify:
5
data_source: "_"
6
use_nlp: true
7
labels:
8
- person_name
9
- location
Copied!
Enabling this feature may be useful if you work with unstructured data and need to label names or locations such as addresses, states, or countries.
Enabling NLP predictions may decrease model prediction throughput by up to 70%.

Supported NLP Models

Gretel currently uses spaCy for making NLP predictions. The following entity types are supported from the model:
  • person_name
  • location - For a list of all locations that spaCy can detect, see this list.
Predictions produced by the spaCy model will be tagged with the source gretel/spacy.
Last modified 1mo ago