If you haven’t already, install the Gretel CLI and SDK. Next, we will create a project to host your model and artifacts.
gretel projects create --display-name healthcare --set-default
Download and preview the dataset that we will be training a synthetic model on.
wget https://gretel-public-website.s3.amazonaws.com/datasets/healthcare-analytics/hospital_ehr_data.csvhead -n 10 hospital_ehr_data.csv
The above command downloads and previews the dataset we will synthesize.
case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay 1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10 2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50 3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40 4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50 5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50 6,23,a,6,X,2,anesthesia,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,4449.0,11-20 7,32,f,9,Y,1,radiotherapy,S,B,3.0,31397,7.0,Emergency,Extreme,2,51-60,6167.0,0-10 8,23,a,6,X,4,radiotherapy,Q,F,3.0,31397,7.0,Trauma,Extreme,2,51-60,5571.0,41-50 9,1,d,10,Y,2,gynecology,R,B,4.0,31397,7.0,Trauma,Extreme,2,51-60,7223.0,51-60
Select a configuration template or download a template from our GitHub and make any modifications you’d like for your use case. We recommend the
default template for most datasets. You will need the model-id outputted after training completes. Make sure to copy the model-id when the worker is finished! If you need to get the model id again, you can use the following command:
gretel models search --all
gretel models create --runner cloud --config synthetics/high-field-count \--in-data hospital_ehr_data.csv --output . > model-data.json
models command outputs a JSON object to standard error that can be used by downstream commands in place of the model ID. In the example above, the output is being saved to model-data.json.
--output parameter is specified the above command will create several files in your local directory. For models trained in the Gretel Cloud, model artifacts can be downloaded at any time with the following command:
gretel models get --model-id [model id] --output .
A preview of your synthetic dataset in CSV format.
HTML report that offers deep insight into the quality of the synthetic model.
A JSON version of the synthetic quality report that is useful to validate synthetic data model quality programmatically.
Log output from the synthetic worker that is helpful for debugging.
Now we will use our synthetic model to create a synthetic dataset. Copy the model ID returned by your
gretel models create command.
gretel records generate --model-id model-data.json --runner cloud \--num-records 5000 --max-invalid 5000 --output .
--output parameter is specified the above command will create several files in your local directory.
data.gz - your synthetic dataset in csv format.
logs.json.gz - Log output from the synthetic worker that is helpful for debugging.