Comment on page
Synthetics
This section covers the model training and generation APIs shared across all Gretel models.
All Gretel Synthetics models follow a similar configuration file format structure. Here is an example
model-config.yaml
schema_version: "1.0"
name: "my-model"
models:
- [model_id]:
data_source: __tmp__
params:
[param_name]: [param_value]
[model_id]
is replaced with the type of model you wish to train (e.g.synthetics
,gpt_x
,actgan
,timeseries_dgan
,amplify, tabular_dp
).data_source
must point to a valid and accessible file in CSV, JSON, or JSONL format.- Supported storage formats include S3, GCS, Azure Blog Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
data_source: __tmp__
can be used when the source file is specified elsewhere using:--in_data
parameter via CLI,- parameter via SDK,
- dataset
button
via Console.
- The
params
object contains key-value pairs that represent the available parameters that will be used to train a synthetic data model on thedata_source
.- Parameters are specific to each model type. See a full list of supported parameters in each of the models pages.
CLI
SDK
Use the following CLI command to create and train a synthetic model.
gretel models create \
--config [config_file_path] \
--name [model_name] \
--runner cloud \
--in-data [data_source] > my-model.json
--in_data
is optional ifdata_source
specified in the config, and can be used to override the value in the config.--in_data
is required ifdata_source: __tmp__
is used in the config--name
is optional, and can be used to override thename
specified in the config
from gretel_client.projects import create_or_get_unique_project
project = create_or_get_unique_project(name="project-name")
from gretel_client.helpers import poll
model = project.create_model_obj(model_config=[config], data_source=[training_data])
model.submit_cloud()
poll(model)
During training, the following model artifacts are created:
Filename | Description |
---|---|
data_preview.gz | A preview of your synthetic dataset in CSV format. |
logs.json.gz | Log output from the synthetic worker that is helpful for debugging. |
report.html.gz* | HTML report that offers deep insight into the quality of the synthetic model. |
report-json.json.gz* | A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically. |
CLI
SDK
Use the
gretel models run
command to generate data from a synthetic model.gretel models run --model-id my-model.json \
--runner cloud \
--param num_records [num] \
--in-data [prompts.csv] \
--output .
--model-id
supports both a modeluid
and the JSON thatmodels create
outputs- There are many different
--param
options, depending on the model.num_records
param is supported by all synthetic models and is used to tell the model how many new rows to generate.
--in_data
is optional and used for conditional data generation when supported by the model
# Generate more records from the model
record_handler = model.create_record_handler_obj(
params={"num_records": 100}
)
record_handler.submit_cloud()
poll(record_handler)
There are many different
params
options, depending on the model. num_records
param is supported by all synthetic models and is used to tell the model how many new rows to generate.
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df

GitHub - gretelai/gretel-synthetics: Synthetic data generators for structured and unstructured text, featuring differentially private learning.
GitHub
Check out our GitHub for research, source code and examples including our core synthetic data generation library.
Last modified 1mo ago