OHDSI Home | Forums | Wiki | Github

De-Identify the CDM Data


(Umesh Yadav) #1

I want to de-identify our custom data which we have converted into CDM. Is there a way to do that? If anybody help in this that will be great.

(Tim Quinn) #2

There is a two-part answer to your question. First, you need a de-identification method. The OHDSI community’s very own Dr. Hripcsak (@hripcsa) has an excellent paper on this topic: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5070517/

Second, you need to implement that method on your OMOP instance. Your implementation approach will depend on your requirements, of course. I’m currently working on an approach for a de-identified OMOP CDM that also supports incremental updates (rather than the usual truncate-and-reload approach). I hope to present this approach in an OHDSI forum in the future (after we confirm it works).

The general idea: Build de-identification views on top of the physical OMOP tables, omitting or replacing the PHI-containing columns with de-identified substitutes. This approach has some nuances, because you don’t want to inadvertently undermine your de-identification by including person_id or location_id or visit_occurrence_id columns unless they are truly masked in “honest broker” fashion.

(Kim Ela) #3

I agree with Tim regarding Dr. Hripcsak. I am currently working on de-identifying our data for use with our vendors and sales teams in order to demo our products. I am using Dr. Hripcsak method for all dates. I’ve created a procedure that identifies around 20k patients to be part of our demo oncology practice. Using the SANT method along with changing patient names and addresses I have been able to completely remove all PHI. I downloaded lists for female and male first names and a list of surnames. I found these lists online at the Census Bureau. https://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html
I downloaded a list of street addresses to use as new patient addresses. I found a great Wiki that has a list of all fictional doctors that I used to generate new doctors and practices. We wanted to be sure nothing could be used to identify. https://en.wikipedia.org/wiki/List_of_fictional_doctors
Street addresses I used: https://catalog.data.gov/dataset/street-address-listing
Zip codes with state and town: https://simplemaps.com/data/us-zips

Using the patients that have been identified, I pull the data out of OMOP and created completely new OMOP tables with the de-identified data. There is nothing in the new tables that can identify the patients in the original OMOP tables. We just got approval from our compliance team that the data can be used.

(Katy Sadowski) #4

Hopping in to say that my team is also working on de-identification of an OMOP CDM. We’re currently in the methodology research & development phase (have consulted Dr. Hripcsak’s paper as well!).

Like @quinnt our current plan is to transform an identified OMOP CDM (in our “production” database) into a de-identified OMOP CDM (in our analytics database).

I’ll be happy to share learnings here as we move through the process :slight_smile:

Here are a few more resources I’ve found helpful:

(George Hripcsak) #5

Hi, all. The temporal de-identification paper that we wrote is probably the safest way to preserve some temporal information beyond literally deleting everything but year, so that time between events is preserved. Having Brad Malin, who guided the US OCR on their interpretation of HIPAA, as a co-author was helpful. The main cost to the method is losing about the last 6 months of data on average.

Keep in mind that initiatives like N3C have taken a slightly different and perhaps more liberal approach.