Classify

Define a policy to discover and label sensitive data including personally identifiable information, credentials, and even custom regular expressions inside text, logs, and other structured data.

The classify API policy structure has two notable sections. First, the models array will have one item that is keyed by classify.

Within the classify object:

  • A data_source is required

    • This parameter can be overloaded via the command line interface (CLI)

    • At this time, csv and plain-text data formats are supported.

  • A labels array is required to specify named entities to search for, including:

    • Supported entities. See the full list.

    • Namespaces for custom regular expressions (optional)

schema_version: "1.0"
name: "my-awesome-model"
models:
  - classify:
      data_source: "_"
      labels:
       - person_name
       - credit_card_number
       - phone_number
       - us_social_security_number
       - email_address

Custom Predictors and Data Labeling

‌Within the config, you may optionally specify a label_predictors object where you can define custom predictors that will create custom entity labels.

‌This example creates a custom regular expression for a custom user id format:

schema_version: "1.0"
name: "classify-my-data"

# ... classify model defined here ...

label_predictors:
  namespace: acme
  regex:
    user_id:
      patterns:
        - score: high
          regex: "user_[\\d]{5}"

If you wish to create custom predictors, you must provide a namespace which will be used when constructing the labels used.

  • regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:

    • score: One of high, med, low. These map to floating point values of .8, .5 and .2 respectively. If omitted the default is high.

    • regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.

‌In the example above, the namespace and the keys of the regex object are combined to create your custom labels. For above, the label acme/user_id will be created when a match occurs.

You can now combine the label_predictors with your classify policy. For example:

schema_version: "1.0"
name: "my-awesome-model"
models:
  - classify:
      data_source: "_"
      labels:
        - acme/*

label_predictors:
  namespace: acme
  regex:
    user_id:
      patterns:
        - score: high
          regex: "user_[\\d]{5}"

Classifying Data using NLP

Adding use_nlp: true to a classification model will enable entity predictions using natural language models.

schema_version: "1.0"
name: "nlp-model"
models:
  - classify:
      data_source: "_"
      use_nlp: true
      labels:
       - person_name
       - location

Enabling this feature may be useful if you work with unstructured data and need to label names or locations such as addresses, states, or countries.

Enabling NLP predictions may decrease model prediction throughput by up to 70%.

Supported NLP Models

Gretel currently uses spaCy for making NLP predictions. The following entity types are supported from the model:

  • person_name

  • location - For a list of all locations that spaCy can detect, see this list.

Predictions produced by the spaCy model will be tagged with the source gretel/spacy.

FAQs

What kinds of entities can the Classify API detect? The Classify API can detect 40+ entity types including names, addresses, credentials, and other identifiable information. Check out the full list.

Last updated