OHDSI Home | Forums | Wiki | Github

Vexing issues with CDM v5 vocabulary files

In our recent work on finalizing the ETL for the ETL-CMS project we found numerous problems with the CDM v5 vocabulary files that have us scratching our head. I started to create a list, but realized it might be best to tackle these issues one at a time in a series of posts. I’m hoping to get clarity on why we see the following issues and how people have been accommodating these problems with their existing ETLs.

The vocabulary specification allows for a source concept to be mapped to more than one target concept. The type of use cases where this make the most sense are, for example, when a drug also requires a procedure to administer, so the source concept might have a “Maps to” to a target vocabulary RxNorm term as well as a CPT4 term. Unfortunately, most of the multiple mappings we see are not of this type. A drug term will map to two different drug terms, or an ICD9 code will map to two different places in the SNOMED vocabulary. This leads to at least two problems. The first problem is that if we create records for both terms (sometimes 3), it will look like multiple prescriptions were given, or multiple Conditions were diagnosed at different levels of specificity when there was only one. A more troubling issue is that some of these terms appear to be wrong. The source dosage matches the target dosage for one term, but not the other, or the source drug name matches one target correctly, but not the second target. For example:

The OHDSI concept_id 44844261, “Metoprolol Tartrate 25MG Oral Tablet”, from the NDC vocabulary, maps to the following two RxNorm terms:

  • 40167213 Metoprolol Tartrate 25 MG Oral Tablet Drug
  • 40166828 24 HR metoprolol succinate 25 MG Extended Release Oral Tablet Drug

The second one does not appear to be correct: succinate != Tartrate.

Another example:
44843559 Oxygen 99 % Gas for Inhalation Drug NDC 11-digit NDC 10109123403, maps to:

  • 19025280 Oxygen 99 % Gas for Inhalation Drug RxNorm
  • 19025301 Oxygen 99.5 % Gas for Inhalation Drug RxNorm

Why not only map to the first one, since the source concept is 99% oxygen, not 99.5%?

We see 20372 NDC codes with 2 or more mappings, and every one of those mappings goes to a Drug vocabulary term. Thus I think we may be mapping to problematic synonyms in some or all of these cases.

In the case of ICD9CM codes there are instances of conditions mapping to both a SNOMED Condition and a SNOMED observation, which seems reasonable. However, consider the ICD9CM example of 44820348 (E954), Suicide and self-inflicted injury by submersion [drowning]. It maps to 3 SNOMED codes:

  • 435174 Suicide - drowning Observation (SNOMED:287192004)
  • 440925 Suicide Observation (SNOMED: 44301001)
  • 40421994 Suicide and selfinflicted injury by drowning Condition (SNOMED:219141008)

The first code (435174) is a more specific code that seems to be the appropriate mapping for the observation, whereas, the second code is the parent code for the first in the SNOMED hierarchy. It does not seem appropriate to include both the daughter and the parent code, as that is what the CONCEPT_ANCESTOR hierarchy is for.

More issues to follow, but I’ll stop here for tonight.



Thanks for reporting these.

However, you need to exclude the deprecated mappings (where invalid_reason is null). Then, all your examples will produce the correct result.

I am thinking of taking away the deprecated CONCEPT_RELATIONSHIP records altogether, because they are just confusing people, and old deprecated concepts are not useful to folks. Only to the guys who develop the vocabulary. It would also reduce the size of the download. I’ll bring it up in the CDM Workgroup.

Thanks, @Christian_Reich, this is a huge help!

Dropping the CONCEPT_RELATIONSHIP records that have non-null invalid_reason cleans up most of my issues, and restores my faith in the CDM (yes it was shaken). With that exclusion, there are no NDC->RxNorm 1-to-many mappings. There are 343 IDC9CM codes with 1-to-many, but spot checking them suggests they make sense. There are no CPT4 1-to-many mappings.

There are, however, 20 HCPCS codes with 2 map entries. They all have “U” in the invalid_reason field of the CONCEPT record, but have valid map entries. These 20 are almost all injections where an RxNorm Drug code is generated for the specific drug, as well as an HCPCS Drug code for the specific drug. Is there a reason for this? It doesn’t seem to be a matter of listing a drug and a separate procedure. See for example, CONCEPT_ID 44786378 which maps to both 42901271 and 44786566.

I’m less worried about these 20 codes, than the numerous HCPCS codes that we see in our CMS source data, that are marked invalid_reason “U” in the CONCEPT table, and have no “Maps to” record in the CONCEPT_RELATIONSHIP field. What do you recommend we do with concepts with “U” or “D” in their invalid_reason column? They are actually used in the source data for the CMS data, and thus should be mapped to something, I would think.

Also, since HCPCS is a destination vocabulary for the CDM, why are there numerous HCPCS codes with an empty invalid_reason, but no “Maps to” record to itself in the CONCEPT_RELATIONSHIP table? There appear to be 170 of these. Conversely, there are 309 HCPCS codes with invalid_reason “D” or “U” that have valid “Maps to” relations.


How could you!!! :smile:

We did a lot of clean up in the past months.

Working on it as we speak, actually. Should be clean in upcoming release, and stay that way going forward. Since we don’t always know what the exact product is we sometimes have to map to several Ingredients or Clinical Drug Components, if they are combination drugs.

These are all procedure drugs. They have no mapping today. Some of them will get one (like 45890798 “Injection, epoetin beta, 1 microgram, (for non esrd use)”), others will not and therefore end up in a record with concept_id=0 (e.g. 2718685 “Contraceptive supply, hormone containing patch, each”).

Te next big issue for us along similar lines, is that within the CMS source data we are seeing a very large number of NDC codes that are marked “D” for INVALID_REASON in the CONCEPT table, Yet these codes have valid CONCEPT_RELATIONSHIP records that map to RxNorm codes that are not marked “D” or “U”.

For instance, NDC 58016005300 with CONCEPT_ID=44986921, Olmesartan medoxomil 40 MG Oral Tablet [Benicar] has INVALID_REASON “D”. It has a valid “Maps to” record in CONCEPT_RELATIONSHIP to 19096752, which is the RxNorm concept for the same drug, which in term has a self-map.

Can we safely follow maps from CONCEPT records that have non-null INVALID_REASON entries? Why are these marked “D”? I’m really confused about how to handle non-null INVALID_REASON entries in the CONCEPT and CONCEPT_RELATIONSHIP tables. Is there some documentation on this somewhere?

Thanks again!

Yes, they are fine.

Here is the deal: The validity flags indicate whether the concept is correct and can be used. Not that the product is on the market or something like that. With NDC, howevr, codes can be reused. Therefore, the dates indicate when that NDC code was representing a particular drug product. But time has moved on since. So, if your data containing an NDC code is recorded at a time within the validity you are ok. If it is after the validity has expired, you should be ok as well. If it is before, there is a theoretical possibility that the code had been used for something else. Your ETL can now just drop it (map to concept_id=0) or use it anyway, knowing that the NDC data are pretty unreliable. That’s your decision. In reality, since those re-uses are very rare, most people just ignore the NDC valid dates.

In the documentation about the CONCEPT table. The CONCEPT_RELATIONSHIP table doesn’t have an explicit text. And again, we may just kick the deprecated ones out, to stop the confusion.

That seems to work for us. Would you say we can use deprecated codes in the same way for HCPCS, CPT4, and ICD9CM vocabularies? That is, if the code was valid during the dates our data was generated, we can follow these maps?

One big issue, we still see that over 1/3 of our NDC codes do not have an undeprecated CONCEPT_RELATIONSHIP map, but in most cases they appear to have a deprecated map – which you say we should ignore.

For example, NDC code 11523129202, CONCEPT_ID 45014164 has two deprecated maps with different VALID_START_DATE entries, one to 42799122, and one to 19130246, which both have “U” in the INVALID_REASON in their CONCEPT records.

The CMS source records were generated from 2008-2010, so are not new NDC concepts. Can we follow these deprecated CONCEPT_RELATIONSHIP links that you were advocating removing in your earlier posts, and pass through some other level of indirection to get RxNorm concepts?


The next thing we have run into is that consistency checks on the vocabulary files error out when it is discovered that there are CONCEPT_RELATIONSHIP, records that refer to concepts not in the CONCEPT file. Similarly, with CONCEPT_SYNONYM and CONCEPT_ANCESTOR.

For CONCEPT_RELATIONSHIP, there are 118,116 relations with blank INVALID_REASON, where CONCEPT_ID_1 or CONCEPT_ID_2 do not exist in the CONCEPT table.

For CONCEPT_SYNONYM, there are 62,074 records where the CONCEPT_ID does not exist in the CONCEPT table.

For CONCEPT_ANCESTOR, there are 414,886 records whose ANCESTOR_CONCEPT_ID or DESCENDANT_CONCEPT_ID do not exist in the CONCEPT table.

The OMOP CDM v5 constraints found here, finds these problems on these lines of code:

ALTER TABLE new_dataset.concept_relationship ADD CONSTRAINT fpk_concept_relationship_c_1 FOREIGN KEY (concept_id_1)  REFERENCES new_dataset.concept (concept_id);

ALTER TABLE new_dataset.concept_relationship ADD CONSTRAINT fpk_concept_relationship_c_2 FOREIGN KEY (concept_id_2)  REFERENCES new_dataset.concept (concept_id);

ALTER TABLE new_dataset.concept_synonym ADD CONSTRAINT fpk_concept_synonym_concept FOREIGN KEY (concept_id)  REFERENCES new_dataset.concept (concept_id);

ALTER TABLE new_dataset.concept_ancestor ADD CONSTRAINT fpk_concept_ancestor_concept_1 FOREIGN KEY (ancestor_concept_id)  REFERENCES new_dataset.concept (concept_id);

ALTER TABLE new_dataset.concept_ancestor ADD CONSTRAINT fpk_concept_ancestor_concept_2 FOREIGN KEY (descendant_concept_id)  REFERENCES new_dataset.concept (concept_id);

Once we filtered out these bad records we were able to properly load our data into the v5 CDM. We checked every box that we could on the vocabulary download page (downloaded today), so it is a mystery what these concepts are that are referred to in these 3 files.

I’d like to +1 vote the removal of invalid relationships from the released CDM. I can’t count the number of times that we forgot a ‘where invalid_reason is null’ in our queries that threw off an analysis.

You follow the Maps to no matter what. If the concept is a Standard Concept it will map to itself. If not, it will pass it on to an equivalent concept.


NDCs you actually have in the data? Let me check it out those cases.

That’s strange. Those constraints are always on when we build the thing. So, this must be something in the download process. Do you still have the zip file?

Thanks for checking all these things out. It’s very helpful.

Which probably means we get rid of the validity information altogether.

I don’t have the zip file, but I’ve seen these errors just using the default vocabularies that are checked in the download process as well as when checking all of them.

Understood. So, here is how it works. When we build the whole system with the vocabularies, all fields are under constraints. Which means it’s clean. When you run Athena to get your extract it should make sure it does so with referential integrity. However, that is not being checked separately. There is no interim database with the constraints on that the files are written from. Therefore, I need to get the constellation you had to have it debugged.

Let me try the checked vocabularies. Problem is we changed the checks. :frowning: So, hopefully we can reproduce it.

I was intentionally careful to say ‘invalid relationships’ because there seems to be some cases where we ignore the INVALID_REASON on concepts when doing the mapping (such as NDC -> rxNorm, you look for the valid relationship, but ignore the ndc might be invalid…I think I’m remembering that properly).

I would like to drop validity information and consider all information in the vocabulary ‘valid’. In the same context, I’d like to get clarification on the valid_from -> valid_to. It seems to me that when doing the ETL mapping, we look for the NDC code that was the valid source concept on the date of the exposure (it could be many years int he past) and then look for the corresponding relationship that is valid for that same period of time that maps to the standard concept that is valid for that period. That sounds like a lot of ‘date validity’ checks but I kinda like it. We were doing a mapping exercise for a study and there was some debate about the purpose of one of the ICD9s that was involved in the study. Some people thoguht it was appropriate, others thought it was not, but as it turns out, they were both right: there was an ammendum in somewhere around 2010 that chnaged the intent of the ICD9 code, meaning that before 2010, it would have mapped to one standard concept, and afterwards, a different one…

I might have overstated how many NDC codes are a problem. I found 2440 in the CMS SynPUF data (years 2008-2010) that have a deprecated CONCEPT_RELATIONSHIP record, but no undeprecated one. I will email this list to you.

Thanks again!

For our data network we are seeing similar issues in the condition occurrence domain, where we have ICD9 and ICD10 codes that have a deprecated mapping to a snomed code.

In some cases the snomed code appears to have been updated by another snomed code, but there exists no mapping from the ICD9 code to the newly updated SNOMED Code.

Examples below:
44821263 4209083 D Maps to 2014-09-30 00:00:00 2013-10-10 00:00:00
44832477 444246 D Maps to 2014-09-30 00:00:00 2012-08-31 00:00:00

concept_class_id | concept_code | concept_id | concept_level | concept_name | domain_id | invalid_reason | standard_concept | valid_end_date | valid_start_date | vocabulary_id
Clinical Finding | 440181000 | 4209083 | | Apparent life-threatening event | Condition | U | | 2015-07-30 | 2009-01-31 | SNOMED

Reading the thread, the advice given is that we should map these codes to 0? Will these codes receive a mapping going forward? How should we proceed?


Is it possible you never ran the CPT4 utility? Without it, you won’t have any CPT4 in the CONCEPT table (but you would in the CONCEPT_RELATIONSHIP and CONCEPT_ANCESTOR tables, because they don’t contain concept_name fields).


Well, the first is really a dumb code: “Apparent life threatening event in infant”. How do you do analytics on that? The second one (“Cirrhosis of liver without mention of alcohol”) needs to be mapped, though. Nothing wrong with it. It’s not clear why it got deprecated.

Here is the deal: The ICD9CM mapping need a final revision (before we stop touching them). It’s on the list. Not super urgent though. Till then, you should be able to use the existing maps. They will have a few booboos like the above, but 99.9% should be just fine.

I started again from the vocab_download_v5 zip file, ran the command “java -jar cpt4.jar” and did the consistency checks, and I still have several hundred thousand codes used in CONCEPT_RELATIONSHIP.csv, CONCEPT_ANCESTOR.csv, and CONCEPT_SYNONYM.csv that do not appear in CONCEPT.csv.

For example, if I run this python code from within the vocabulary directory it will print out all the CONCEPT_RELATIONSHIP records that have no matching CONCEPT:

d  = {}
f2 = open("CONCEPT.csv","r");
line = f2.readline()
for line in f2:
    v = line.split("\t");
    d[v[0]] = True

f1 = open("CONCEPT_RELATIONSHIP.csv","r")
line = f1.readline()
print line,
for line in f1:
    v = line.split("\t");
    if (not d.has_key(v[0])) or (not d.has_key(v[1])):
        print  line,

Have you tried retrieving and checking the vocabulary files through the Athena download process?

Thanks again for looking into these issues.