OHDSI Home | Forums | Wiki | Github

ETL Mapping specs for CDM v5.0?

Are there plans to publish any CDM v5.0 mapping specs on this site, as was done with v4.0? Specifically, is anyone working on v5.0 specs for Truven and/or CPRD, and plans on publishing them here?



We have been posting our stuff here but it wasn’t up to date so I fixed that.

@amatcho is working on CPRD. We just finished running it here so I’m assuming it will be available soon. I’ll follow-up once we have it online.

Thanks @ericaVoss:

We are working on an implementation that does person-at-a-time processing, and were thinking of various ways to approach the files distributed by Truven (i.e. each year of data is in a separate file). In the past, we’ve loaded all the data for each file type into a single table, then processed all years of data for that person at the same time. We’d like to shard the files and run multiple ETL processes, each process handling a set of persons.

Do you know if there are any problems processing each year independently? Some of the issues I’ve identified from working with this data (and other datasources) in the past are:

  • If enrollment coverage spans years, we may end up with multiple contiguous observation periods instead of one continuous record.

  • The current spec allows a gap of 32 days or less between enrollment periods; a gap may occur at the end / beginning of the year, and would not be detected.

  • Fields related to the Person table have derivation rules such as “use the last valid value encountered (sequenced by record date )” (eg. birth year, gender); we need to view all enrollment-detail records before building a person record.

  • Some systems have the concept of adjustment records, where negative cost numbers are encountered. These are subtracted from corresponding records from a previous claim (linked by encounter id, or some other indicator). These adjustments can occur in years different than the original claim.

  • Depending on the data aggregation time frame, claim data for a previous year may appear in subsequent years.

In general, I understand that there are a lot of unknowns when it comes to processing the data by year independently, but want to get the group’s opinion on things.



Your process is very similar to ours. We load all data over all years into a database, then our CDM_BUILDER chunks up the people and runs a person at a time. We can then distribute the work across machines.

We do not rebuild the SOURCE each time, we just append our new data to the end of it and reprocess indexes/statistics and such. The CDM is completely rebuilt each time.

Actually, giving it a little thought, you might be able to do it by year. It is more work and will increase your run time (you’ll have a lot of post processing that wouldn’t have to be done running off the full DB).

  • You would have the problem that OBSERVATION_PERIODS would be year bound, but maybe you just create a job at the end that stitches them all together at the end.

  • ERAS would be fine, you would just run them after all the years are done running.

  • For the PERSON data we take the last person’s data - so you could have a situation that a person changed gender/birth year - but you could write each year’s information out and then post processing collapse the data.

  • For collapsing adjustment claims, you could just do the adjustment in the year and just accept you might be missing opportunities to collapse some records. Truven does a fairly good job of cleaning up the data, it isn’t like collapsing is a major component anyway.

I’ll noodle on it a bit more and see if I can come up with a reason why it would be a bad idea. Also, I am assuming you’ll have to do a full CDM build each time, you’ll need to get all data on the same VOCAB. I would not recommend appending to a CDM due to needing to keep up with the VOCAB.

@donohara - made the slightest update to the document (NDCs are now date centric so anywhere we are mapping NDCs we needed to look up the date prescribed to the VALID_START_DATE and VALID_END_DATE in the VOCAB).

@amatcho is finishing up her CPRD document and I will update once we have it.

Hi Erica - I have a few questions about the CCAE/MDCR mapping spec:

  1. On page 12, (Provider table) you reference the JNJ_TRU_P_SPCLTY vocabulary. Is that table proprietary, or will you be putting it into the public domain (and uploading to OHDSI?)

  2. On page 14, (Visit Occurrence) the spec says "Extract records within observation periods where a person has both prescription benefits and medical benefits from Truven INPATIENT_SERVICES and OUTPATIENT_SERVICES tables. "

  • I am using RX=1 to indicate the person has prescription benefits
  • What fields/values are used to determine whether the person has medical benefits?
  1. On page 15, (Visit Occurrence), the spec says “Some codes will map to more than one standard concept so they will have more than one DOMAIN_ID. For instance, the ICD9 code ‘250.02’ maps to two standard concepts; One with a DOMAIN_ID of ‘Condition’ and one with a DOMAIN_ID of ‘Measurement’”
  • Should the program emit 2 records for one input code in this event, one to each output table?
  1. On page 32, (Death), the spec says: “If there are health care visits after one month (32 days) of death date, delete this record.
    • What does “this record” refer to, the death record, or the health care visit record(s)?


Also, has there been any discussion of putting sample test input data, along with the validated output CDM data, into the public domain? I.E. CDM builders can implement their own version of the specs, and then use the standard input data and verify their results against the correct output data.


  1. Yes, you are correct, I need to make JNJ_TRU_P_SPCLTY public. I will let you know when I do.
  2. Yes, for Truven all you need to do is RX=1 and you’ll have people both with RX and medical benefits. I think Optum we get some people who have RX only.
  3. We write both records, one to CONDITION and additional one to MEASUREMENT.
  4. The death record is deleted if we still see activity post 32 days of death.

Oh man @donohara, you are exposing all the things I still need to do!! :slight_smile: We have created a test data in the form of the source and we have generated a CDM from that test data. We plan to share both the test source data and generated CDM.

Test frameworks sound great. My question is how there can be a standard if there is no agreement on what a visit is? Or is there agreement that I don’t know about? The reason I ask is that a test suite is supposed to address the edge cases to ensure that the code handles the data properly.

I should probably clarify that there is, of course, agreement on what a visit is. But there are a lot of gray areas where there isn’t agreement.
One example would be whether inpatient hospitalizations that were admitted from the ED should be reported as visits. Another might be the type of visit reported in an outpatient file for observation when the person was, in fact, in the hospital (applies to Medicare). There are other nuances, including whether people want to link all physician records to a hospitalization (some data sources to this already – MarketScan or Optum I believe). How does one distinguish visits to different providers on the same day?


Well, yes, it can be argued that there is some wiggle room on what people think should define a VISIT. But we did write up a paper on our approach of how to defined a VISIT in claims data sets:
Voss EA, Ma Q, Ryan PB. The impact of standardizing the definition of visits on the consistency of multi-database observational health research. BMC Med Res Methodol. 2015 Mar 8;15:13. doi: 10.1186/s12874-015-0001-6. PubMed PMID: 25887092; PubMed Central PMCID: PMC4369827.

But back to our test cases, like you said, they are designed to test that our CDM_BUILDER implements the logic laid out in our Truven documentation. Obviously if we disagree on an implementation then the test cases are less useful. Either way, the test cases solidify how we were thinking about some of the logic - then anyone could see given a specific input what we thought the output should be, per our logic.

Thanks for the reference! I have not read it yet, but I am guessing it would be good to reference it in the CDM specs.

We also put together a test framework for the Medicare ETL. Don’t think anyone actually tested their ETL code against it, but having that Truven one available will help out a lot of people. No rush, of course. :smile:


Want to take it to the CDM WG and create a final binding recommendation for all cases?

1 Like

@Christian_Reich - of course!

@donohara - I got the Vocabulary additions in GitHub:

We put these into SOURCE_TO_CONCEPT_MAP.

I will get the test cases together and @amatcho CPRD document soon.

Hi Amy (@amatcho): Do you know when the CPRD mapping spec for CDM v5 will be posted here? Can you upload what you have, even if it’s not finalized?


Amy sent it to me and it is on my to do list, soon!

@donohara, here is @amatcho’s CPRD document and Vocabulary additions.

I’ll get out test data and CDM output later this week for both Truven/CPRD.

Dear Erica,
where could I find an empty/template version of this document? I could only find one on the OMOP.org site for V4.
Thanks in advance.


I don’t think there is one. :sweat:

The best I could suggest is take one of the CDM documents, like the TRUVEN one and strip out the text. If you do that please send it to me and we’ll find somewhere to put it so other people can grab it too.

Your best bet is to use WhiteRabbit, which will produce a word document
that’ll serve as a useful template to get you going.


Haha, @Patrick_Ryan is always right.

How about thinking about it like this: