Skewing of Timestamps on a per-patient basis

mbockhacker · March 17, 2020, 2:40pm

Hi all,

first of all I’d like to say, that we’ve not yet implemented the whole CDM, however in order to be able to work with real data down the line I know that we will need a detailed evaluation with our Data-Privacy Officers at our Institution. While the CDM does pretty well in terms of minimizing identifiable information, I’m pretty sure we’ll need to further pseudonomize datasets during the ETL process.

One concept that we have done in other (smaller) projects includes the skewing of all Timestamps while maintaining their spacial relationship (see pseudocode below).

for each patient:
factor_days = random(-180,180)
for each timestamp of patient:
timestamp = timestamp + factor_days

My Questing is this: Are there any best-practises for doing these kind of transformations within the ETL for the OMOP CDM? Possibly with the addition of a lookup-table to store the randomized values for each patientid?

Thanks for your time!

Best Regards,
Markus

MaximMoinat · March 23, 2020, 1:18pm

Hi Markus. That is a very interesting point, and I can see that this might be needed in some cases. However, in our years of experience with OMOP ETL, we have not come across any general conventions for this. It is not common practice to change the actual dates during the transformation process, mainly because the OMOP’ed data will stay behind the a firewall.

We do sometimes choose to exclude identifiable information like the original identifier, addresses, birth month/day, names, exact time of visit, etc. Or derive approximate dates from a ‘days since …’ variable. Also in other projects we have experience with pseudonymising gps coordinates while keeping spatial relations, in a similar way as you describe for dates.

Please allow me to ask you one question; why do you need to put more stringent pseudonymisation on the OMOP’ed data than on the source data? The OHDSI data network is a federated network, meaning that the data can stay behind the original firewall and only aggregated results are shared for publication.

mbockhacker · March 24, 2020, 10:21am

Thanks for your reply. While it is true, that the data stays at the institution, our local data-privacy regulations are more stringend on what constitutes “processing” of patient data, even inside institutions. Even more so if the analysis is a observational, retrospective trial for which obtaining informed consent is not a realistic option.