Can white rabbit be used to generate fake data?

mccullen_j · November 14, 2023, 3:12am

I need to generate (somewhat realistic and connected) synthetic data using the source schema and noticed White Rabbit has the ability to generate fake data from your source scan.

My concern is that this is not truly “fake” and it would be easy for PII to leak in. How valid is this concern and is there a recommended approach for a task like this using White Rabbit? Should it not be done at all? I’m wondering if I should do this purely manually or if White Rabbit can help and I just need to comb through it to make sure it is truly fake.

schuemie · November 14, 2023, 6:01am

WhiteRabbit’s fake data generator uses the scan report to sample data records. It is a really simple procedure, where for every field in every row it just independently samples from the values reported in the scan report. (I don’t remember what it does for values that are completely unique and are therefore not in the scan report, such as peron IDs. I think it just generates numbers that fit in the field’s data type)

Since the scan report does not contain patient-identifiable information, the fake data also does not.

Be aware that the generated fake data is absolute nonsense. It just contains values observed in the source data, but with no regard to their relationship. It is mainly intended to make sure your ETL code does not contain syntax errors.

mccullen_j · November 14, 2023, 5:25pm

Thanks! This is good to know.

I’m not sure I trust that the scan report doesn’t actually have PII though, need to double check how well it did actually removing it. That was my concern.