I would like to know if the codes in the International Classification of Diseases for Oncology (ICD-O) are included into the OMOP vocabularies? I did a quick search on some of the code descriptions and there are matches against SNOMED, but not against ICD10CM. These codes look like Mxxxx/x.
Hello @ericaVoss,
If you happen to know, could you please share timeframe by when we may expect to see the ICD-O codes included in the OMOP vocabulary?
Thank you!
This is good stuff! If it worked, it would make our life a lot easier.
Couple of questions and observations:
We should have this debate in the Forum, for example here. So everybody can see it. If you don’t mine, I’ll transfer it there. Discussing this in comments to a Google doc is also awkward.
I was trying to follow your example. I can go from the OMOP 4187868 “pTis: Ductal carcinoma in situ” (which is SNOMED 373176000) to the UMLS CUI C0007124, and from there to the NCI code C2924. But that’s already funny: C2924 is “Ductal Carcinoma In Situ of Breast”. The original SNOMED concept had no breast in it. Obviously, all ductal carcinomas are in the breast, as no other organs has “ducts” that can degenerate to a malignant disase, but it just isn’t a synonym. In the above example is indeed the ICD-O-3 “Synonym” 8500/2, but it only represents the morphology. The breast is now dropped.
I tried to figure out what is going on and it looks like this:
Which means that the pre-coordinated concepts of Morphology and Topography are not covered through this NCI crosswalk, but the SNOMED Morphology Concepts are. Which is good, because SNOMED gives us those links.
I am going to have the NCI Metathesaurus downloaded and linked to SNOMED through UMLS. Maybe, on Tuesday we can have some numbers of what it looks like in real life.
Rimma is worried about two aspects: The linking of ICD-O-3 to SNOMED go through those duplicate ambiguous codes. And that the level of granularity of Mrophology and Tolography will not be properly reflected in SNOMED.
I’m more than happy to continue the discussion here @Christian_Reich, and thanks a lot for taking the time to go through the example. I’ll add my thoughts re. each of your points
Re. 1. I agree, this is better than the Google doc
Re. 2. Well, the relationship between UMLS CUI C0007124 and SNOMED 373176000 comes via UMLS itself, atom A3805436. And with NCI C2924 via atom A7644506 (is not something I came up with). You can check this online in the UMLS website or I can give you the SQL statement if you have a local instance of UMLS.
So far we agree that this would give you only the histology code from ICDO. The way you get to the site code is via a relationship called ‘disease_has_associated_anatomic_site’. Below is how it looks in UMLS RRF format
just search in that page for ‘Disease_Has_Associated_Anatomic_Site’ then click on the Breast link. There (under by source, ICDO) you will see that breast maps to ICDO site C50-C50.9 (meaning 50, 50.1, …, 50.9)
Thats the way we currently link together a histology and site ICDO codes
Re. 3. Please refer to the previous example, and use the ‘Disease_Has_Associated_Anatomic_Site’ relationship to find the corresponding ICDO site codes for those histologies
Re. 4. Great, let me know if you need help with that, I have some custom scripts to load it into our databases
Re 5. Not sure which ambiguous codes you are referring to. In terms of ICD-O-3 to SNOMED, I agree, most likely you won’t find all ICD-O histology codes in SNOMED since they serve different purposes.
Finally, I would like to add why I would encourage you to consider the approach that UMLS and NCI use, i.e. ‘Disease_Has_Associated_Site’ instead of creating a new OMOP concept per each histology and site combination. The reason is because in all tumor registries, included ours (MSK), you will encounter a histology code, e.g. DCIS 8500/2, in combination with any site code corresponding to breast (C50 - C50.9).
If I understand the proposal correctly you are basically saying
let’s create a new OMOP concept for DCIS histology, e.g. 12345 (I’m just making it up)
let’s create a new OMOP concept for Upper-inner quadrant of breast, e.g. 23456
then, let’s create a new OMOP concept that associate both, something like 'DCIS of the Upper-inner quadrant of breast’ (which I’ve personally never seen), e.g. 34567
My point is that with this approach you will have to create one new concept for each histology and site combination (which I have personally never seen before and will become non-practical in my opinion).
For instance
create a new OMOP concept for Lower-inner quadrant of breast (LIQ), e.g. 4444
create a new OMOP concept for 'DCIS of the Lower-inner quadrant of breast’, e.g. 5555 (which once again looks awkward to me)
Instead I would add the histology codes, then the site codes, and bring them together via a possible new relationship call hasSite with inverse isSiteOf
This will avoid the creation of unnecessary concepts, in the example above you would avoid the creation of concepts 34567 and 55555. In my opinion it is much cleaner, and easier to maintain since both categories of concepts could evolve separately
Again, just my personal opinion based on my understanding of the proposal (just trying to help here).
The proposal does not suggest creating a new concept for every histology and site combination. Just every combination deemed by an authoritative source to be possible and valid. That authoritative source being the ICD-O-3 SEER Site/Histology Validation List. Retrievable from here: https://seer.cancer.gov/icd-o-3/.
I think it is important to measure the data loss that would be caused by mapping to SNOMED via UMLS/NCI Thesaurus/Metathesaurus. How many ICD-0-3 histologies would be mapped to the same SNOMED code paired for the same site? My own experience with oncology professionals is that a vocabulary like ICD-10 is too ambiguous. There is a perception it floats free from the working language of how pathologists declare cancer diagnoses. My instinct is that SNOMED will fail on the same point. Though only SQL queries measuring the data loss will settle that.
I believe promoting ICD-O-3 as a first class vocabulary will make OHDSI/OMOP more generally useful to the oncology community. I don’t believe SNOMED is on the radar of most users of oncology data. I believe forcing its use will turn people off of choosing OHDSI/OMOP as an platform for oncology data. Though this could be just America-centric perspective.
I just want to clarify some of the terminology here. ICD-O-3 is a combination of vocabularies. It is kind of like Snomed in that it requires several codes to communicate the complete clinical situation.
By definition, ICD-O-3 includes location (where is the cancer), histology (what does the cell look like), behavior (malignant, benign, etc), and grade (amount of differentiation). Each of these is a set of codes – its own vocabulary. Location is adapted from ICD-10. Histology and behavior come combined (8500/2 in the example above where /2 means in situ and 8500 is the cell type), although many people think of them separately. Grade is sometimes left out (IARC doesn’t seem to use it). And laterality is not included in ICD-O-3.
I don’t have a lot of experience with Snomed, but in my brief search, I did not see that Snomed was organized in a compatible way. Sometimes some pieces were grouped together in ways that didn’t align with the way the oncology world thinks about cancer. I don’t have a specific example, but it was something like Snomed would have a code for malignant neoplasm of the breast, which isn’t something that ICD-O-3 can say because Snomed has location and the behavior part of histology, but no histology). Also, blood cancers (lymphoma and leukemia) are mostly grouped on histology whereas solid tumors are mostly grouped by location.
After thinking about if for a long time, we decided that we would use oncology vocabularies for our own data modeling. It was the only way to guarantee that we could query what we wanted. So, I agree with Michael. I would be happy if an Snomed expert would solve the problem, but I couldn’t do it.
I should have been a little more precise with my terminology above. Morphology refers to the combination of histology, behavior and grade. Though histology and behavior are together in one set of codes and grade is its own set of codes.
Ah! Excellent. I looked briefly, and on first glance it looks like those who have the topography pre-coordinated contain that link, and the others morphology-only don’t. That’s what you would expect.
Except that you have to link “Small cell carcinoma” and “Small cell carcinoma of the lung”. Because the former has the histology “property”, and the latter has the site “relationship”. How do you do that?
Look for yourself. There is a missing link. Unless I am missing something…
We were thinking of the following script:
Take SNOMED parent of all malingnant disorders: 443392 Malignant neoplastic disease SNOMED-code 363346000 and pull all descendants from the CONCEPT_ANCESTOR TABLE
Do the crosswalk to the NCI code through UMLS
Pull the ICDO histology(ies) and the ICDO site(s) from the NCI Metathesaurus
Assess coverage and quality
We can easily do 1-2. But it looks like the NCI download will take a couple days. If you want, we can divide the labor and do it together. Let me know.
No need for that, Iker. We prefer stealing over recreating by a factor of infinite to one.
Totally understood. What we wanted to do is to see how bad it is. If it is bad, we will create separate fields, if it is feasible (because the number of actually existing combos is in the 1000s, not millions) we can pre-coordinate and save on new fields. And stay backwards-compatible. This is an open-ended investigation where we collectively have to make a decision.
Please keep doing this. We need every hand on board of folks knowing this stuff.
Great! That’s our number: 10500. Totally doable. Let’s check it against real data.
Yes. That’s what I am proposing to do with the exhaustive script. That’s also what @rimma was saying.
Agreed. And the idea is to augment or extend what’s in SNOMED with self-made ICD-O-derviced concepts. Not to replace one with the other or to abandon one entirely.
We know. The debate is whether or not we should pre-coordinate (permute all possible and use all useful) combinations.
It actually is, even though it’s a well-kept secret. The pre-coordinated concepts of a diagnosis (histology-site-grade) have links to the components. In fact, SNOMED does that with all diseases, not just neoplasms. But regarding quality and comprehensive coverage - we need to evaluate.
Bad boy!!! Don’t do anything without your friends here! What did you end up using?
@Christian_Reich The SEER validation list represents more than 10,500 combinations. The Site recode column in the spreadsheet uses comma-separated ranges of ICD-O-3 site codes. For example, this value:
‘C000-C006,C008-C009’
actually represents 9 ICD-O-3 site codes:
C00.0 External upper lip
C00.1 External lower lip
C00.2 External lip, NOS
C00.3 Mucosa of upper lip
C00.4 Mucosa of lower lip
C00.5 Mucosa of lip, NOS
C00.6 Commissure of lip
C00.8 Overlapping lesion of lip
C00.9 Lip, NOS
Not sure why SEER decided to format the file in such an obtuse manner.
Ah. got it. Looks like we need to multiply with a factor of 10 or so. Need to do a proper join and count it up. Tomorrow I’ll have it. 100k codes is a heavier lift, but still doable, if we collectively believe this is the right thing to do. I don’t think we are there, yet. I like the alternative just as much, to be honest.
1- if you start from histology, then you get its children and the site ICDO codes for them. The first level child with the correct site code is the one you are looking for
2- if you start from concepts which have ‘Disease_has_associated_anatomic_site’ then you traverse the tree upwards (get the parents) until you find the corresponding histology code. The concept you are after is the one whose first ancestor has the corresponding histology code you are interested in (Small Cell Carcinoma) in this case
This is the approach we follow to provide mappings for our internal controlled terminologies via ICDO histology and site combination, see http://oncotree.mskcc.org/oncotree-mappings/
Also, if you think about it this is actually the way ICDO works. A type of histology can occur in multiple sites, therefore it makes total sense that you start from the histology and look for the more specialized concept, i.e. child, that contains the proper site information
I can try to work on 3, and 4. Email me separately and let me know what exactly needs to be done, happy to help
I can see that
100% agree, definitely use ICDO rather than SNOMED for oncology. The point is not to replace ICDO with SNOMED, but to leverage both in a sense. The most simple use case you need to empower is for whoever has a tumor registry (histology and site combinations) to be able to bring that tumor registry into the CDM easily. In order to do that I would (in the short term)
1- Identify the overlap, i.e. see for how many histology codes you already have OMOP concepts
2- for those, create a property called OMOP_to_ICDO_3 or something like that
3- for the rest create their own OMOP concept
4- Use the hasSite and isSiteOf relationship to specify ICDO site information
After this, the natural next step for those who mapped a tumor registry would be to bring billing data (ICD10) together with the tumor registry. If you do this, then you could easily know that ICD10 D05 (DCIS) is the same OMOP concept id that ICDO 8500/2. Otherwise (if you create new concepts for all histology and site combinations) these will end up being two OMOP concepts and there will be no way to reconcile them. Effectively, these institutions will keep having the same problem with than without the CDM.
In the longer term though, I would suggest revisiting one of the main design concept behind the OMOP vocabulary. For each concept ID in the concept table only one source vocabulary is identified, e.g. 443392 Malignant neoplastic disease SNOMED-code 363346000. Then mappings/relationships from this concept to other controlled terminologies are specified via relationship_id in the concept_relationship table, e.g. ICD9 - SNOMED eq
So if my data is in ICD9 I need to leverage this relationship in order bring it into the CDM. I think this approach is ok when your vocabularies i) are complimentary and they kind of belong to different subject areas, i.e. ICD10 for dx and LOINC for labs, and ii) there are just a few that overlap a little, e.g. SNOMED and ICD9/10
Over time, and as you try to integrate more vocabularies from the same/similar subject area, like ICDO and SNOMED or ICD10. I think that the right approach is the UMLS/NCI like approach where you have high level concept ids and everything folds underneath them.
I remember I spoke with @Christian_Reich about this some time ago, and he mentioned he made this conscious decision at the very beginning of OMOP, just wondering if maybe the project has evolved and might be time to revisit. Hope I’m not opening pandora’s box here
Thanks
Iker
P.S:
We (MSK) extensively use NCI’s work and created our own internal terminologies for tumor types (mainly driven by path) http://oncotree.mskcc.org/oncotree/#/home and several others
Sorry for the naïve question, but I am having trouble understanding where to draw the line of where things go.
For cardiology, we have heart failure under condition, but the ejection fraction, which quantifies the heart failure, is in measurement.
For ID, we have infection under condition, and often the organism, but only rarely the sensitivity (other than methicillin resistant, etc.). Sensitivity usually goes with the micro data.
If the histology worsens over time, or if the metastatic location grows over time, is that analogous to ejection fraction?
And then there is:
3. Traverse a similar relationship within SNOMED.
We should evaluate all 3.
Understood. And SNOMED works that way too. We need to figure out where the former takes over, where the latter is not detailed enough. Which is what you are saying below.
Will do.
Agreed.
This is a big subject for a new Forum posting, Iker. Called “Pandora’s Box” But yes, we will have to go there and debate. Probably for a face-to-face meeting with a couple of folks who are interested in this.
Ejection fraction is a way to diagnose, and to establish the severity of congestive hear failure. But the disease is still fully characterized by calling it “congestive heart failure”. In oncology, it is different. The histology/morphology is a disease characterization itself. E.g. for skin cancer, it makes a big difference whether it is derived from the basal cells or the melanocytes. It affects progression, prognosis and treatment. The anatomical site is also a characterizing factor, because it will also define progression, prognosis and treatment. And finally, so is the stage. These are not observations. They are part of the Condition and causal to everything that follows. We are trying to figure out how to incorporate the level of details that is necessary to describe these Conditions properly.
Infections are characterized by cause (which bug), morphology (granulomatous or ordinary purulent), and anatomical site. But SNOMED gives us all we need there in terms of pre-coordinated concepts. Sensitivity, like ejection fraction, are measurements. So is genetic abberations in cancers (though that is increasingly becoming part of the Condition definition as well).
…then the Condition changes. It’s no longer the same problem. Treatment options change with it.
It comes to 49,831 pairings. Looks like this would be the minimum number of pairings needing to be mapped to a SNOMED code. However, looking at the results, the pairings appear to be restricted to ‘reportable’ parings. As ‘8000/0’ is only paired with Primary CNS sites because I believe only benign primary CNS tumors are reportable. So the number of parings will increase if benign tumors are expanded beyond primary CNS sites.
I have the NCI Metathesaurus installed locally. Does any body have the SQL already crafted to be able to take a site/histology pairing to query the NCI Metathesaurus table structures (which seem to mirror the UMLS) to yield a SNOMED code?
This is great. From an ICD-O-3 perspective, this is the total space we need to cover. We now need to map it to SNOMED. We also need to map it to ICD10, to make sure their mapping to SNOMED doesn’t lead us awry, and the cycle closes nicely.
I haven’t installed the NCI Metathesaurus, yet. The UMLS doesn’t have the links to ICD-O-3, only between SNOMED and the NCI code. @ihuerga has that piece, and he is also the one who pulled the inital mapping for us.
@rimma@Christian_Reich@ihuerga
I did a little research on SNOMED and ICD-O-3. As Christian mentioned SNOMED, like ICD-O-3 is multi-axial. Two of the top-level SNOMED axes are Topography and Morphology. SNOMED already has a mapping to all the codes in both ICD-O-3 axes. I downloaded the latest version of SNOMED and ICD-O-3 and put them in the same database. Then queried the SNOMED mapping table, ‘curr_simplemaprefset_f’, for how many times each ICD-O-3 code is mapped to a SNOMED code. Here are the results:
All ICD-O-3 codes are covered, but some times more than once. In other words, some ICD-O-3 codes map to multiple SNOMED codes. It appears there is no map between ICD-O-3 site/histology parings and SNOMED codes, but perhaps we could ask SNOMED to create them. Starting with combinations based on the SEER site/histology validation list. Would need to look elsewhere for the same parings for non-CNS sites and benign histologies.
There is a SNOMED confluence project page describing SNOMED work related to ICD-O-3. See here:
It might we worth reaching out to this group to get their thoughts. Also, SNOMED seems to be flexible about tilting toward both pre and post coordination. SNOMED has a process for submitting request changes. I would advocate cutting out the middle man of the NCI Thesaurus/Metathesaurus and ask SNOMED to create a coordinated pairing of each valid Site/Histology paring. While we are waiting for this, I would simply stick with @rimma original draft version of creating an OMOP/OHDSI place holder vocabulary in the interim.
I have a file that contains all the ICD-O-3 codes to SNOMED codes as well but sharing it might not be compatible with the SNOMED license. But can share it with those interested.
@rimma@Christian_Reich@ihuerga
I made some mistakes in my SQL to calculate the counts of ICD-O-3 codes to SNOMED codes. I replaced the CSV files with new ones:
The correction in my SQL revealed some ICD-O-3 codes that have no mapping to a SNOMED code. In the CSV files, these are the entires with a snomed_code_map_count = 0. After the correction there are 6 histology ICD-O-3 codes without a mapping to a SNOMED code. And 111 site ICD-O-3 codes without a mapping to a SNOMED code. However, 70 of the un-mapable ICD-O-3 site codes are all the top level 3 character site categories, like ‘C00’=‘lip’ or ‘C09’=‘tonsil’. I don’t believe these higher level categorization codes are used in the wild, so the problem boils down to 41 bottom level ICD-O-3 5 character site codes not mappable to SNOMED code.
One other thing I need to verify is which version of ICD-O-3 is available from the WHO site for download. As that is what I used. From here:
If anybody wants an updated ICD-O-3 to SNOMED code mapping file, let me know.
Also, if anybody has a contact at SNOMED to help discuss the current state/future plans of the SNOMED to ICD-O-3 code mapping, please forward me their contact information.