OHDSI Home | Forums | Wiki | Github

De-Identify the CDM Data

I want to de-identify our custom data which we have converted into CDM. Is there a way to do that? If anybody help in this that will be great.

There is a two-part answer to your question. First, you need a de-identification method. The OHDSI community’s very own Dr. Hripcsak (@hripcsa) has an excellent paper on this topic: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5070517/

Second, you need to implement that method on your OMOP instance. Your implementation approach will depend on your requirements, of course. I’m currently working on an approach for a de-identified OMOP CDM that also supports incremental updates (rather than the usual truncate-and-reload approach). I hope to present this approach in an OHDSI forum in the future (after we confirm it works).

The general idea: Build de-identification views on top of the physical OMOP tables, omitting or replacing the PHI-containing columns with de-identified substitutes. This approach has some nuances, because you don’t want to inadvertently undermine your de-identification by including person_id or location_id or visit_occurrence_id columns unless they are truly masked in “honest broker” fashion.


I agree with Tim regarding Dr. Hripcsak. I am currently working on de-identifying our data for use with our vendors and sales teams in order to demo our products. I am using Dr. Hripcsak method for all dates. I’ve created a procedure that identifies around 20k patients to be part of our demo oncology practice. Using the SANT method along with changing patient names and addresses I have been able to completely remove all PHI. I downloaded lists for female and male first names and a list of surnames. I found these lists online at the Census Bureau. https://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html
I downloaded a list of street addresses to use as new patient addresses. I found a great Wiki that has a list of all fictional doctors that I used to generate new doctors and practices. We wanted to be sure nothing could be used to identify. https://en.wikipedia.org/wiki/List_of_fictional_doctors
Street addresses I used: https://catalog.data.gov/dataset/street-address-listing
Zip codes with state and town: https://simplemaps.com/data/us-zips

Using the patients that have been identified, I pull the data out of OMOP and created completely new OMOP tables with the de-identified data. There is nothing in the new tables that can identify the patients in the original OMOP tables. We just got approval from our compliance team that the data can be used.

1 Like

Hopping in to say that my team is also working on de-identification of an OMOP CDM. We’re currently in the methodology research & development phase (have consulted Dr. Hripcsak’s paper as well!).

Like @quinnt our current plan is to transform an identified OMOP CDM (in our “production” database) into a de-identified OMOP CDM (in our analytics database).

I’ll be happy to share learnings here as we move through the process :slight_smile:

Here are a few more resources I’ve found helpful:

1 Like

Hi, all. The temporal de-identification paper that we wrote is probably the safest way to preserve some temporal information beyond literally deleting everything but year, so that time between events is preserved. Having Brad Malin, who guided the US OCR on their interpretation of HIPAA, as a co-author was helpful. The main cost to the method is losing about the last 6 months of data on average.

Keep in mind that initiatives like N3C have taken a slightly different and perhaps more liberal approach.

I am new to OHDSHI and have a basic question. Let’s assume that I build de-identification views as you mention that successfully preserve patient privacy. And let’s say I want to run a pilot project on-premise to install ATLAS on top of our university CDM data. Would I need to point our local ATLAS instance to the de-identified views or does ATLAS have application level control that ensures that underlying PHI data will NOT be exposed during hypothesis exploration by researchers. I have played with ATLAS and I don’t think I can get to any PHI data. Thank You.

hi Susan! Yes, very good questions.

On PHI - a good advise is to always follow a proper de-identification process. Someone folks forget that it is not just stripping our names, identifiers but also any low count events or attributes that could be used to identify a person. For example, someone age of 102 yo, very rare condition etc… If de-identification is done properly, it is not possible to re-identify no matter how information is accessed.

ATLAS was built to conduct data analytics and returns summary level results. It was not designed to access and view patient records. That said, it is possible to see some patient record data through the Patient profile page or Patient sampling. However, if proper de-identification is followed - you should be safe. You can also disable access to that page easily through ATLAS admin access control tools.

Hi Dr. @hripcsa, I’m a newbie talking about de-identification and recently I read your paper where you propose the SANT method. I did a proof of concept where I transformed data to OMOP from 100 patients in intensive care. The length of stay has a median of 7 days, so I was wondering if the SANT method is appropiate for this type of data where the time series are of a short period. My concern is, what about the patient who had an inpatient visit for a couple of days? Since her data points are already small, truncating her data would leave this patient with valueable information out of the analysis. What cold be done in this case?

Hi, @Alonso . The method in that paper truncates 1 to 366 days. If the ICU stay happens to have occurred in the truncation window, then yes you would usually lose the whole stay. If the ICU stay occurred before the truncation window, then you would not lose it. The truncation time is set by the date of the ICU stay. It’s not that the ICU stay is lined up the end of the truncation window. You may have ICU stays from 10 years ago. None of those would be truncated. So on average, you would lose 6 months worth of ICU stays.

Thanks @hripcsa for the clarification. I’ll read your paper carefully again, and I’ll detail in this thread a practical example to confirm what I understood about the SANT method.

Does anyone know of an IRB that has approved SANT as being deidentified? What affect does the truncate have on results? How do you manage hospitalizations that span a truncation date, as it’d be inappropriate to slice the hospitalization right in the middle?

I know that SANT is in use, but I don’t know how it is presented to IRBs. I don’t know of the Office for Civil Rights commenting on it (that would be the most useful). But a co-author of the paper is OCR’s primary privacy consultant.

You slice hospitalizations in the middle, no different than the fact that the database has an end date, and anyone still in the hospital at that date also has their hospitalization sliced in the middle. In fact, that’s the point. It should look as if you reached the end of the database. If you normally throw away hospitalizations that are still in progress, then I guess you should do that. But we don’t throw them away.