Ask or search…
K
Comment on page

Synthetics

This section covers the model training and generation APIs shared across all Gretel models.

Model Configuration

All Gretel Synthetics models follow a similar configuration file format structure. Here is an example model-config.yaml
schema_version: "1.0"
name: "my-model"
models:
- [model_id]:
data_source: __tmp__
params:
[param_name]: [param_value]
  • [model_id] is replaced with the type of model you wish to train (e.g. synthetics, gpt_x, actgan, timeseries_dgan, amplify, tabular_dp).
  • data_source must point to a valid and accessible file in CSV, JSON, or JSONL format.
    • Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
      • Some models have specific data source format requirements
    • data_source: __tmp__ can be used when the source file is specified elsewhere using:
      • --in_data parameter via CLI,
      • parameter via SDK,
      • dataset button via Console.
    • The params object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on the data_source.
      • Parameters are specific to each model type. See a full list of supported parameters in each of the models pages.
Gretel has configuration templates that may be helpful as starting points for creating your model.

Create and Train a Model

CLI
SDK
Use the following CLI command to create and train a synthetic model.
gretel models create \
--config [config_file_path] \
--name [model_name] \
--runner cloud \
--in-data [data_source] > my-model.json
  • --in_data is optional if data_source specified in the config, and can be used to override the value in the config.
  • --in_data is required if data_source: __tmp__ is used in the config
  • --name is optional, and can be used to override the name specified in the config

Designate project

from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="project-name")

Create model object and submit for training

from gretel_client.helpers import poll
model = project.create_model_obj(model_config=[config], data_source=[training_data])
model.submit_cloud()
poll(model)
During training, the following model artifacts are created:
Filename
Description
data_preview.gz
A preview of your synthetic dataset in CSV format.
logs.json.gz
Log output from the synthetic worker that is helpful for debugging.
report.html.gz*
HTML report that offers deep insight into the quality of the synthetic model.
report-json.json.gz*
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
*Not all models produce a Synthetic Data Quality Report. See the models page for more details.

Generate data from a model

CLI
SDK
Use the gretel models run command to generate data from a synthetic model.
gretel models run --model-id my-model.json \
--runner cloud \
--param num_records [num] \
--in-data [prompts.csv] \
--output .
  • --model-id supports both a model uid and the JSON that models create outputs
  • There are many different --param options, depending on the model.
    • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.
  • --in_data is optional and used for conditional data generation when supported by the model

Create and submit record handler

# Generate more records from the model
record_handler = model.create_record_handler_obj(
params={"num_records": 100}
)
record_handler.submit_cloud()
poll(record_handler)
There are many different params options, depending on the model.
  • num_records param is supported by all synthetic models and is used to tell the model how many new rows to generate.

View results

synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df
GitHub - gretelai/gretel-synthetics: Synthetic data generators for structured and unstructured text, featuring differentially private learning.
GitHub
Check out our GitHub for research, source code and examples including our core synthetic data generation library.