I want to search records for email addresses and replace them with fake ones.
modelsarray will have one item that is keyed by
policiesarray, each item in this array has 2 keys:
name: The name of the policy
rules: A list of specific data matching conditions and the actual transforms that will be done. Each rule contains the following keys:
name: The name of the rule
conditions: The various ways to match data (more on this below)
transforms: The actual mutations that will take place (more on this below).
transformsobjects. In this particular example, we have created the following rule contents:
value_label), will have a different value depending on the condition’s matching behavior. The details on each condition can be found below.
value_label, which will inspect every record for a particular entity, in this case, we are searching every record for an email address.
transformsobject defines what actions will happen to the data. The
transformsvalue is an array of objects that are keyed like:
type: The name of the transform (required)
attrs: Depending on the type, there may be specific attributes that are required or optional. These attributes are covered in the Transforms section below.
conditionsthat are keyed by the functionality of the condition. The configuration below, shows all possible
conditionswith their respective matching values.
value_label: This condition will scan every record for a matching entity in the list. The value for this condition is a list of entity labels. These labels may be Gretel built-in labels or custom defined labels. When labels are found with this condition, Gretel tracks the start and end indices for where the entity exists in the value. Transforms are applied specifically to these offsets.
field_label: This condition will match on every value for a field if the entire field has been labeled as containing the particular entity. The value for this condition is an array of labels. During training, a
field_labelmay be applied to an entire field a couple of ways:
value_labelentities exist for a field, that entity label will be applied to the field. By default, this cutoff is 90%. This cutoff can be defined in the label predictors object in the configuration. For example if the training data has 8000 records, and at least 7200 values in the
firstnamefield are detected as
person_name, then the entire
firstnamefield will be classified with the person_name label and every value in that field will go through transforms.
field_name: This will match on the exact value for a field name. It is case insensitive.
field_name_regex: This allows a regular expression to be used as a matcher for a field name. For example using ^regex will match on fields like
field_attributes: This condition can be set to an object of boolean flags. Currently supported flags are:
is_id: If the field values are unique and the field name contains an ID-like phrase like “id”, “key”, etc.
transformsobject takes an array of objects that are structured with a
typewill always define the specific transform to apply. If that transform is configurable, the
attrskey will be an object of those particular settings for the transform.
attrsfor each transform, and will be noted as being optional.
field_labelconditions since a specific entity needs to be detected. If this transform is applied to other conditions, no transform will take place.
seed: An optional integer that is used to seed the underlying entity faker. This seed will provide the determinism when faking. This helps ensure that a given value is faked to the same fake value consistently across all transforms. If this is omitted, a random seed will be created and stored with the model.
field_ref: If provided, the fake value will be the same for every value of the
user_idfield in this example. The first fake value for the first instance of
user_idwill be cached, and used for every subsequent transform. This field may also be an array of field names.
secret: An optional string that will be used as input to the HMAC.
mysecretwould get redacted to XXXXXXX.
char: The character to be used when redacting.
[min, max], where
maxare specified in the config. The shift amount is an integer, but that can be controlled with the
precisionconfig. The number after the shift is not rounded (i.e. floating point values are preserved).
field_refconfig is provided.
minis set to -1000, and
maxto 1000, which means that the shifted value will be in a range
min: The minimum amount to add to a number. Default is -1000.
max: The maximum amount to add to a number. Default is 1000.
precision: Number of digits after decimal point to be used when generating the shift amount. Default is 0 (generated amount will be an int).
field_ref: If provided, will shift consistently for every value of the
start_datefield in this example. The first random shift amount for the first instance of
start_datewill be cached, and used for every subsequent transform. This field may also be an array of field names.
field_name(deprecated): has been renamed to
field_refand will be removed in a future version of the config.
numbershift. With the addition of a formats key.
numbershift. The min and max values refer to number of days to shift.
formats: A date time format that should be supported. An example would be
%Y-%m-%d. The default value is infer which will attempt to discover the correct format.
min: The lowest number to consider when bucketing
max: The highest number to consider when bucketing
nearest: The nearest multiple to adjust the original value to
method: One of min, max or avg. The default is min. This controls how to set the bucketed value. Consider the original value of 103 with a nearest value of 5:
minWould bucket to 100
maxWould bucket to 105
avgWould bucket to 102.5 (since that is the average between the min and max values).
label_predictorsobject where you can define custom predictors that will create custom entity labels.
regex: Create your own regular expressions to match and yield custom labels. The value for this property should be an object that is keyed by the labels you wish to create. For each label you wish to create, you should provide an array of patterns. Patterns are objects consisting of:
score: One of high, med, low. These map to floating point values of .8, ,5 and .2 respectively. If omitted the default is
regex: The actual regex that will be used to match. When crafting your regex and testing it, ensure that it is compatible with Python 3.
acme/user_idwill be created when a match occurs.
true. For example