OHDSI Home | Forums | Wiki | Github

Vexing issues with CDM v5 vocabulary files

I was intentionally careful to say ‘invalid relationships’ because there seems to be some cases where we ignore the INVALID_REASON on concepts when doing the mapping (such as NDC -> rxNorm, you look for the valid relationship, but ignore the ndc might be invalid…I think I’m remembering that properly).

I would like to drop validity information and consider all information in the vocabulary ‘valid’. In the same context, I’d like to get clarification on the valid_from -> valid_to. It seems to me that when doing the ETL mapping, we look for the NDC code that was the valid source concept on the date of the exposure (it could be many years int he past) and then look for the corresponding relationship that is valid for that same period of time that maps to the standard concept that is valid for that period. That sounds like a lot of ‘date validity’ checks but I kinda like it. We were doing a mapping exercise for a study and there was some debate about the purpose of one of the ICD9s that was involved in the study. Some people thoguht it was appropriate, others thought it was not, but as it turns out, they were both right: there was an ammendum in somewhere around 2010 that chnaged the intent of the ICD9 code, meaning that before 2010, it would have mapped to one standard concept, and afterwards, a different one…

I might have overstated how many NDC codes are a problem. I found 2440 in the CMS SynPUF data (years 2008-2010) that have a deprecated CONCEPT_RELATIONSHIP record, but no undeprecated one. I will email this list to you.

Thanks again!

For our data network we are seeing similar issues in the condition occurrence domain, where we have ICD9 and ICD10 codes that have a deprecated mapping to a snomed code.

In some cases the snomed code appears to have been updated by another snomed code, but there exists no mapping from the ICD9 code to the newly updated SNOMED Code.

Examples below:
CONCEPT_ID_1 CONCEPT_ID_2 INVALID_REASON RELATIONSHIP_ID VALID_END_DATE VALID_START_DATE
44821263 4209083 D Maps to 2014-09-30 00:00:00 2013-10-10 00:00:00
44832477 444246 D Maps to 2014-09-30 00:00:00 2012-08-31 00:00:00

SNOMED CODE
concept_class_id | concept_code | concept_id | concept_level | concept_name | domain_id | invalid_reason | standard_concept | valid_end_date | valid_start_date | vocabulary_id
------------------±-------------±-----------±--------------±--------------------------------±----------±---------------±-----------------±---------------±-----------------±--------------
Clinical Finding | 440181000 | 4209083 | | Apparent life-threatening event | Condition | U | | 2015-07-30 | 2009-01-31 | SNOMED

Reading the thread, the advice given is that we should map these codes to 0? Will these codes receive a mapping going forward? How should we proceed?

@Christophe_Lambert:

Is it possible you never ran the CPT4 utility? Without it, you won’t have any CPT4 in the CONCEPT table (but you would in the CONCEPT_RELATIONSHIP and CONCEPT_ANCESTOR tables, because they don’t contain concept_name fields).

@burrowse:

Well, the first is really a dumb code: “Apparent life threatening event in infant”. How do you do analytics on that? The second one (“Cirrhosis of liver without mention of alcohol”) needs to be mapped, though. Nothing wrong with it. It’s not clear why it got deprecated.

Here is the deal: The ICD9CM mapping need a final revision (before we stop touching them). It’s on the list. Not super urgent though. Till then, you should be able to use the existing maps. They will have a few booboos like the above, but 99.9% should be just fine.

I started again from the vocab_download_v5 zip file, ran the command “java -jar cpt4.jar” and did the consistency checks, and I still have several hundred thousand codes used in CONCEPT_RELATIONSHIP.csv, CONCEPT_ANCESTOR.csv, and CONCEPT_SYNONYM.csv that do not appear in CONCEPT.csv.

For example, if I run this python code from within the vocabulary directory it will print out all the CONCEPT_RELATIONSHIP records that have no matching CONCEPT:

d  = {}
f2 = open("CONCEPT.csv","r");
line = f2.readline()
for line in f2:
    v = line.split("\t");
    d[v[0]] = True
f2.close()

f1 = open("CONCEPT_RELATIONSHIP.csv","r")
line = f1.readline()
print line,
for line in f1:
    v = line.split("\t");
    if (not d.has_key(v[0])) or (not d.has_key(v[1])):
        print  line,
f1.close()

Have you tried retrieving and checking the vocabulary files through the Athena download process?

Thanks again for looking into these issues.

@Christian_Reich: re: “Well, the first is really a dumb code: “Apparent life threatening event in infant”. How do you do analytics on that?” Term of art - think of it as “chest pain” for infants.

For additional analytic fun, what I’ll call “sociology” is keeping the diagnostic landscape interesting: ALTEs are now officially BRUEs (http://emedicine.medscape.com/article/1418765-overview). But I trust in the ability of concept_relationship to iron out the bumps between “near-SIDS”, “ALTE”, and “BRUE” and make standard analytics possible. :slight_smile:

Interesting. I learned something new.

Yes. It’s passing Oracle constraints requiring a concept_id for concept_relationships. Can you give us the list it prints?

I tracked down the problem. On Linux:

java -jar cpt4.jar

does not properly append to CONCEPT.csv, but rather produces a lower-case concept.csv. I thought this problem had been fixed, per this post. If I append concept.csv to CONCEPT.csv, the problems with CONCEPT_RELATIONSHIP.csv, CONCEPT_ANCESTOR.csv, and CONCEPT_SYNONYM.csv having terms that do not appear in CONCEPT.csv go away.

1 Like

@Christophe_Lambert:

Ah. Regression problem. Sorry about that. Glad it works. Keep it coming.

We are still on the NDC problem, but it looks like that RxNorm made a change to prefabricated injectables (like syringes with the substance in there as powder): Instead of the concentration they provide the total amount. They provide new codes for these and proper upgrades for the old ones, but only for those that are currently prescribable. I am still finding out and will as RxNorm what’s going on. It may take a little.

1 Like

We have examples of the condition codes from various site’s source data in our network that currently have a deprecated mapping in concept_relationship. Would you like us to send you our full list for your team to investigate?

@burrowse: Be so kind.

Hi @Christian_Reich,

I believe cpt4.jar is either broken again, or there is a problem with the server side. I downloaded the vocabulary with the default vocabularies on 6/12/2016, and got this error:

$java -jar cpt4.jar 5
Exception in thread "main" com.sun.xml.internal.ws.fault.ServerSOAPFaultException: Client received SOAP Fault from server: Java heap space Please see the server log to find more detail regarding exact cause of the failure.
at com.sun.xml.internal.ws.fault.SOAP11Fault.getProtocolException(SOAP11Fault.java:178)
at com.sun.xml.internal.ws.fault.SOAPFaultBuilder.createException(SOAPFaultBuilder.java:116)
at com.sun.xml.internal.ws.client.sei.StubHandler.readResponse(StubHandler.java:238)
at com.sun.xml.internal.ws.db.DatabindingImpl.deserializeResponse(DatabindingImpl.java:189)
at com.sun.xml.internal.ws.db.DatabindingImpl.deserializeResponse(DatabindingImpl.java:276)
at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(SyncMethodHandler.java:104)
at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(SyncMethodHandler.java:77)
at com.sun.xml.internal.ws.client.sei.SEIStub.invoke(SEIStub.java:147)
at com.sun.proxy.$Proxy34.getCode(Unknown Source)
at org.odhsi.utils.cpt.Application.process(Application.java:113)
at org.odhsi.utils.cpt.Application.main(Application.java:84)

Has anyone else had this problem? It occurs on both Windows and Linux.

Hello @Christophe_Lambert!

Can you try to run it with “java -Xmx512M -XX:MaxPermSize=128m -jar cpt4.jar 5”?

It now works both with and without specifying the extra command line arguments. Perhaps it was an intermittent issue with the server.

@Vladimir_Nikolaenko:

Can we improve the error messaging of the utility? The dump of the stack is probably not that useful.

@Vladimir_Nikolaenko
Also, I was surprised to see that when running the cpt4.jar two times on the same data I got the 14580 CPT4 codes tacked onto the CONCEPT.csv file in a different order each time. I would recommend that it be stable from one run to the next – perhaps printing them out in the same order as they appear in CONCEPT_CPT4.csv. That way, we can readily run a diff between changing versions of vocabularies. Note: I am not 100% sure that it is not a matter of CONCEPT_CPT4.csv coming back in two different orders (sorry).

Yes, we’ll do

Thx, we’ll think about it

t