OHDSI Home | Forums | Wiki | Github

International Classification of Diseases for Oncology (ICD-O)

(Christian Reich) #9

Ah! Excellent. I looked briefly, and on first glance it looks like those who have the topography pre-coordinated contain that link, and the others morphology-only don’t. That’s what you would expect.

Except that you have to link “Small cell carcinoma” and “Small cell carcinoma of the lung”. Because the former has the histology “property”, and the latter has the site “relationship”. How do you do that?

Look for yourself. There is a missing link. Unless I am missing something…

We were thinking of the following script:

  1. Take SNOMED parent of all malingnant disorders: 443392 Malignant neoplastic disease SNOMED-code 363346000 and pull all descendants from the CONCEPT_ANCESTOR TABLE
  2. Do the crosswalk to the NCI code through UMLS
  3. Pull the ICDO histology(ies) and the ICDO site(s) from the NCI Metathesaurus
  4. Assess coverage and quality

We can easily do 1-2. But it looks like the NCI download will take a couple days. If you want, we can divide the labor and do it together. Let me know.

No need for that, Iker. We prefer stealing over recreating by a factor of infinite to one. :smiley:

Totally understood. What we wanted to do is to see how bad it is. If it is bad, we will create separate fields, if it is feasible (because the number of actually existing combos is in the 1000s, not millions) we can pre-coordinate and save on new fields. And stay backwards-compatible. This is an open-ended investigation where we collectively have to make a decision.

Please keep doing this. We need every hand on board of folks knowing this stuff.

Great! That’s our number: 10500. Totally doable. Let’s check it against real data.

Yes. That’s what I am proposing to do with the exhaustive script. That’s also what @rimma was saying.

Agreed. And the idea is to augment or extend what’s in SNOMED with self-made ICD-O-derviced concepts. Not to replace one with the other or to abandon one entirely.

We know. The debate is whether or not we should pre-coordinate (permute all possible and use all useful) combinations.

It actually is, even though it’s a well-kept secret. The pre-coordinated concepts of a diagnosis (histology-site-grade) have links to the components. In fact, SNOMED does that with all diseases, not just neoplasms. But regarding quality and comprehensive coverage - we need to evaluate.

Bad boy!!! Don’t do anything without your friends here! :smiley: What did you end up using?

(Michael Gurley) #10

@Christian_Reich The SEER validation list represents more than 10,500 combinations. The Site recode column in the spreadsheet uses comma-separated ranges of ICD-O-3 site codes. For example, this value:


actually represents 9 ICD-O-3 site codes:

C00.0 External upper lip
C00.1 External lower lip
C00.2 External lip, NOS
C00.3 Mucosa of upper lip
C00.4 Mucosa of lower lip
C00.5 Mucosa of lip, NOS
C00.6 Commissure of lip
C00.8 Overlapping lesion of lip
C00.9 Lip, NOS

Not sure why SEER decided to format the file in such an obtuse manner.

(Christian Reich) #11


Ah. got it. Looks like we need to multiply with a factor of 10 or so. Need to do a proper join and count it up. Tomorrow I’ll have it. 100k codes is a heavier lift, but still doable, if we collectively believe this is the right thing to do. I don’t think we are there, yet. I like the alternative just as much, to be honest.

(Iker Huerga) #12

Good question. Two ways to do that

1- if you start from histology, then you get its children and the site ICDO codes for them. The first level child with the correct site code is the one you are looking for

2- if you start from concepts which have ‘Disease_has_associated_anatomic_site’ then you traverse the tree upwards (get the parents) until you find the corresponding histology code. The concept you are after is the one whose first ancestor has the corresponding histology code you are interested in (Small Cell Carcinoma) in this case

This is the approach we follow to provide mappings for our internal controlled terminologies via ICDO histology and site combination, see http://oncotree.mskcc.org/oncotree-mappings/

Also, if you think about it this is actually the way ICDO works. A type of histology can occur in multiple sites, therefore it makes total sense that you start from the histology and look for the more specialized concept, i.e. child, that contains the proper site information

I can try to work on 3, and 4. Email me separately and let me know what exactly needs to be done, happy to help

I can see that

100% agree, definitely use ICDO rather than SNOMED for oncology. The point is not to replace ICDO with SNOMED, but to leverage both in a sense. The most simple use case you need to empower is for whoever has a tumor registry (histology and site combinations) to be able to bring that tumor registry into the CDM easily. In order to do that I would (in the short term)

1- Identify the overlap, i.e. see for how many histology codes you already have OMOP concepts
2- for those, create a property called OMOP_to_ICDO_3 or something like that
3- for the rest create their own OMOP concept
4- Use the hasSite and isSiteOf relationship to specify ICDO site information

After this, the natural next step for those who mapped a tumor registry would be to bring billing data (ICD10) together with the tumor registry. If you do this, then you could easily know that ICD10 D05 (DCIS) is the same OMOP concept id that ICDO 8500/2. Otherwise (if you create new concepts for all histology and site combinations) these will end up being two OMOP concepts and there will be no way to reconcile them. Effectively, these institutions will keep having the same problem with than without the CDM.

In the longer term though, I would suggest revisiting one of the main design concept behind the OMOP vocabulary. For each concept ID in the concept table only one source vocabulary is identified, e.g. 443392 Malignant neoplastic disease SNOMED-code 363346000. Then mappings/relationships from this concept to other controlled terminologies are specified via relationship_id in the concept_relationship table, e.g. ICD9 - SNOMED eq

So if my data is in ICD9 I need to leverage this relationship in order bring it into the CDM. I think this approach is ok when your vocabularies i) are complimentary and they kind of belong to different subject areas, i.e. ICD10 for dx and LOINC for labs, and ii) there are just a few that overlap a little, e.g. SNOMED and ICD9/10

Over time, and as you try to integrate more vocabularies from the same/similar subject area, like ICDO and SNOMED or ICD10. I think that the right approach is the UMLS/NCI like approach where you have high level concept ids and everything folds underneath them.

I remember I spoke with @Christian_Reich about this some time ago, and he mentioned he made this conscious decision at the very beginning of OMOP, just wondering if maybe the project has evolved and might be time to revisit. Hope I’m not opening pandora’s box here :smile:



We (MSK) extensively use NCI’s work and created our own internal terminologies for tumor types (mainly driven by path) http://oncotree.mskcc.org/oncotree/#/home and several others

(Mark Danese) #13

That is very nice, as is your other link above.

(George Hripcsak) #14

Sorry for the naïve question, but I am having trouble understanding where to draw the line of where things go.

For cardiology, we have heart failure under condition, but the ejection fraction, which quantifies the heart failure, is in measurement.

For ID, we have infection under condition, and often the organism, but only rarely the sensitivity (other than methicillin resistant, etc.). Sensitivity usually goes with the micro data.

If the histology worsens over time, or if the metastatic location grows over time, is that analogous to ejection fraction?


(Christian Reich) #15

That’s in the NCI Metathesaurus?

And then there is:
3. Traverse a similar relationship within SNOMED.

We should evaluate all 3.

Understood. And SNOMED works that way too. We need to figure out where the former takes over, where the latter is not detailed enough. Which is what you are saying below.

Will do.


This is a big subject for a new Forum posting, Iker. Called “Pandora’s Box” :smile: But yes, we will have to go there and debate. Probably for a face-to-face meeting with a couple of folks who are interested in this.

(Christian Reich) #16

Ejection fraction is a way to diagnose, and to establish the severity of congestive hear failure. But the disease is still fully characterized by calling it “congestive heart failure”. In oncology, it is different. The histology/morphology is a disease characterization itself. E.g. for skin cancer, it makes a big difference whether it is derived from the basal cells or the melanocytes. It affects progression, prognosis and treatment. The anatomical site is also a characterizing factor, because it will also define progression, prognosis and treatment. And finally, so is the stage. These are not observations. They are part of the Condition and causal to everything that follows. We are trying to figure out how to incorporate the level of details that is necessary to describe these Conditions properly.

Infections are characterized by cause (which bug), morphology (granulomatous or ordinary purulent), and anatomical site. But SNOMED gives us all we need there in terms of pre-coordinated concepts. Sensitivity, like ejection fraction, are measurements. So is genetic abberations in cancers (though that is increasingly becoming part of the Condition definition as well).

…then the Condition changes. It’s no longer the same problem. Treatment options change with it.

My 2 cents.

(Michael Gurley) #17

Hello, I created a script to create pairings of site/histology from the SEER site/histology validation list. Here are the results in CSV format:


The SEER site/histology validation list is published here:


It comes to 49,831 pairings. Looks like this would be the minimum number of pairings needing to be mapped to a SNOMED code. However, looking at the results, the pairings appear to be restricted to ‘reportable’ parings. As ‘8000/0’ is only paired with Primary CNS sites because I believe only benign primary CNS tumors are reportable. So the number of parings will increase if benign tumors are expanded beyond primary CNS sites.

I have the NCI Metathesaurus installed locally. Does any body have the SQL already crafted to be able to take a site/histology pairing to query the NCI Metathesaurus table structures (which seem to mirror the UMLS) to yield a SNOMED code?

(Christian Reich) #18


This is great. From an ICD-O-3 perspective, this is the total space we need to cover. We now need to map it to SNOMED. We also need to map it to ICD10, to make sure their mapping to SNOMED doesn’t lead us awry, and the cycle closes nicely.

I haven’t installed the NCI Metathesaurus, yet. The UMLS doesn’t have the links to ICD-O-3, only between SNOMED and the NCI code. @ihuerga has that piece, and he is also the one who pulled the inital mapping for us.

(Michael Gurley) #19

@rimma @Christian_Reich @ihuerga
I did a little research on SNOMED and ICD-O-3. As Christian mentioned SNOMED, like ICD-O-3 is multi-axial. Two of the top-level SNOMED axes are Topography and Morphology. SNOMED already has a mapping to all the codes in both ICD-O-3 axes. I downloaded the latest version of SNOMED and ICD-O-3 and put them in the same database. Then queried the SNOMED mapping table, ‘curr_simplemaprefset_f’, for how many times each ICD-O-3 code is mapped to a SNOMED code. Here are the results:


All ICD-O-3 codes are covered, but some times more than once. In other words, some ICD-O-3 codes map to multiple SNOMED codes. It appears there is no map between ICD-O-3 site/histology parings and SNOMED codes, but perhaps we could ask SNOMED to create them. Starting with combinations based on the SEER site/histology validation list. Would need to look elsewhere for the same parings for non-CNS sites and benign histologies.
There is a SNOMED confluence project page describing SNOMED work related to ICD-O-3. See here:


It might we worth reaching out to this group to get their thoughts. Also, SNOMED seems to be flexible about tilting toward both pre and post coordination. SNOMED has a process for submitting request changes. I would advocate cutting out the middle man of the NCI Thesaurus/Metathesaurus and ask SNOMED to create a coordinated pairing of each valid Site/Histology paring. While we are waiting for this, I would simply stick with @rimma original draft version of creating an OMOP/OHDSI place holder vocabulary in the interim.

I have a file that contains all the ICD-O-3 codes to SNOMED codes as well but sharing it might not be compatible with the SNOMED license. But can share it with those interested.

(Michael Gurley) #20

@rimma @Christian_Reich @ihuerga
I made some mistakes in my SQL to calculate the counts of ICD-O-3 codes to SNOMED codes. I replaced the CSV files with new ones:


The correction in my SQL revealed some ICD-O-3 codes that have no mapping to a SNOMED code. In the CSV files, these are the entires with a snomed_code_map_count = 0. After the correction there are 6 histology ICD-O-3 codes without a mapping to a SNOMED code. And 111 site ICD-O-3 codes without a mapping to a SNOMED code. However, 70 of the un-mapable ICD-O-3 site codes are all the top level 3 character site categories, like ‘C00’=‘lip’ or ‘C09’=‘tonsil’. I don’t believe these higher level categorization codes are used in the wild, so the problem boils down to 41 bottom level ICD-O-3 5 character site codes not mappable to SNOMED code.

One other thing I need to verify is which version of ICD-O-3 is available from the WHO site for download. As that is what I used. From here:


If anybody wants an updated ICD-O-3 to SNOMED code mapping file, let me know.

Also, if anybody has a contact at SNOMED to help discuss the current state/future plans of the SNOMED to ICD-O-3 code mapping, please forward me their contact information.

(Christian Reich) #21


Where is that thing? Not in the SNOMED distribution file SnomedCT_InternationalRF2_Production_20170131T120000.zip.

You tossed them into the spreadsheet, which tries to be smart and recognizes the ICD-O histology codes as a date (the year 8001 etc.). I am staying away from Excel and it’s derivative these days because it always trying to do that (usually create scientific notations of long identifiers) and I don’t notice until much later.

Absolutely. Let’s bring it up with Jim Case. From time to time we are submitting bad concepts or relationships to a website he is running, so he knows us.

Interesting. We should definitely talk to them.

That would be good.

(Michael Gurley) #22

The data for the ‘curr_simplemapprefset_f’ table is located in
‘Full/Refset/Map/der2_sRefset_SimpleMapFull_US1000124_20170301.txt’ file within the unzipped SNOMED download file. Open this file and look for some ICD-O-3 codes and you find the gold.

I used this project as my DDL compiler/data loader: https://github.com/rorydavidson/SNOMED-CT-Database. If you look at the scripts to load data from this project you will see the above referenced file being loaded into ‘curr_simplemapprefset_f’ table. Not sure how official the table names from this project are, I am a SNOMED newbie.

As for the spreadsheet, it is a CSV format that you should be able to download from Google Drive and open with a clean text editor to inspect undisturbed by any “smart” coercions. I only see the date coercion upon preview in Google Drive. I guess Google is not perfect.

I would be happy to propose my "great’ ideas to Jim. Joking aside, I think the world could use a pre-coordinated ICD-O-3 site/histology paring. Would make the job of people fitting ICD-O-3 data into data models like the OHDSI/OMOP CDM much easier.

(Christian Reich) #23


Great. You are making life so much easier. I found the file.

Come to think of: This is non-trivial, because they mustn’t create these pairs if there already is an existing precoordinated concept from the 2 axes. If the links from all SNOMED concept to their two dimensions were reliable it would work well. But I have my doubts. We’ll look into this.

(Michael Gurley) #24

@Christian_Reich Looks like the CSV download from the WHO website that I used to map to SNOMED is for ICD-O-3, not ICD-O-3.1. Does not look like WHO provides a CSV download of ICD-O-3.1. Very annoying. WHO does provide a document that spells out in Appendix 7 the differences between ICD-O-3 and ICD-O-3.1. See here:


It looks like from my querying of my local copy of the NCI Metathesaurus that it contains ICD-O-3.1. So I will recalculate the ICD-O-3.1 to SNOMED map counts using ICD-0-3.1 from the NCI Metathesaurus.

(Christian Reich) #25


Oh man. Sounds like fun. The WHO is notorious for not providing proper digital distribution files, but this PDF nonsense instead. Same thing with ATC, same thing with ICD-10. Thank God for UMLS and NCI, here.

(Michael Gurley) #26

@Christian_Reich @rimma @ihuerga

Using histologies for ICD-O-3.1 from the NCI Metathesaurus, the number of histologies unmapped to SNOMED is now none. There is one inactive ICD-O-3.1 code that can’t be mapped: 8240/1 Updated the CSV files:


Whether or not retired ICD-O-3 or even ICD-O-2 codes should be mapped or not would set a higher bar.

(Michael Gurley) #27

@Christian_Reich @rimma @Vojtech_Huser @ihuerga

I have learned a little more about SNOMED CT. My previous ICD-O-3.1 to SNOMED mapping efforts were not taking into account expired mappings. @Vojtech_Huser helped me figure that out. I will post the results of my direct mappings later this week. It looks like there are two possible SNOMED CT maps/refsets to use for histology within SNOMED CT and one for sites. The results will be different from what I previously reported.

Also, I discovered that topography and morphology are under the larger SNOMED CT Axis ‘Body Structure (body structure)’ and that SNOMED CT does indeed precoordinate topography/morphology or site/histology parings into SNOMED codes in the sub-axis "Disease (disorder) under the ‘Clinical Finding’ top-level SNOMED CT axis. Which makes sense to me. I will post results later this week on how well the SNOMED CT precoordinations cover the SEER site/histology validation list parings. You can see the precoordination in this example of ‘Anaplastic astrocytoma of brain (disorder)’ in the SNOMED CT browser (Click the Diagram tab):


(Michael Gurley) #28

@Christian_Reich @rimma

Here are my latest ICD-O-3 to SNOMED mapping results:

I found that SNOMED has 3 axes of interest

  1. ​Morphologically abnormal structure (morphologic abnormality) axis. Which is mapped via a SNOMED refset to ICD-O-3 morphology codes.
  2. Anatomical structure (body structure) axis. Which is mapped via a SNOMED refset to a ICD-O-3 site code.
  3. Disorder axis (which can be a pre-coordinated combination of a ‘Finding Site’ attribute relationship and an ‘Associated Morphology’ attribute relationship).

Which is exactly what we are looking for, I believe. To see this visually, do the following:

  1. Go to http://browser.ihtsdotools.org/?perspective=full&conceptId1=8551000119100&edition=us-edition&release=v20170301&server=https://prod-browser-exten.ihtsdotools.org/api/snomed&langRefset=900000000000509007

  2. Click ‘Diagram’ tab

  3. Go to http://browser.ihtsdotools.org/?perspective=full&conceptId1=7712004&edition=us-edition&release=v20170301&server=https://prod-browser-exten.ihtsdotools.org/api/snomed&langRefset=900000000000509007

  4. Click the ‘Refsets’ tab

  5. Go to http://browser.ihtsdotools.org/?perspective=full&conceptId1=3898006&edition=us-edition&release=v20170301&server=https://prod-browser-exten.ihtsdotools.org/api/snomed&langRefset=900000000000509007

6). Click the ‘Refsets’ tab

So far I have:

  1. Created the list of combinations of ICD-O-3 site/histology via the SEER site/histology validation list.


This represents 49,831 site/histology combinations. Though this is only covering SEER reportable combinations. Benign non-primary CNS neoplasms are not covered.

  1. Mapped each of the ICD-O-3 Site/Morphology axes to SNOMED codes via the SNOMED refsets. For the histology axis, there are two possible refsets:

ICD-O simple map reference set (foundation metadata concept) 446608001

– 13 unmapped
– 854 mapped to one
– 198 mapped to more than one


– 43 unmapped
– 4 mapped to one
– 283 mapped to more than one


CTV3 simple map reference set (foundation metadata concept) 900000000000498005

– 5 unmapped
– 1060 mapped to one
– 0 mapped to more than one


  1. Found all the SNOMED disorders that have a pre-coordinated relationship via the ‘Finding Site’ attribute relationship and an ‘Associated Morphology’ attribute relationship.

69,824 disorders

  1. Found all the matching combination SEER site/histology pairings mapped to SNOMED codes to pre-coordinated SNOMED disorder codes


1,924 mappings from a ICD-O-3 site/histology parings mapped to SNOMED codes
973 distinct ICD-O-3 site/histology site/histology pairings

46 mapped to one
927 mapped to more than one

So the final upshot is 973 out of 49,831 parings can be mapped or 2%. And 927 out of the 973 can be mapped to more than one. Not a very impressive result.

Here is some of the SQL I used for anyone interested:

SELECT distinct d.conceptid
, r.destinationid AS histology_destinationid
, r2.destinationid AS site_destinationid
FROM curr_description_f d
join curr_relationship_f r on d.conceptid = r.sourceid and r.active = ‘1’ and r.typeid = ‘116676008’ – “Associated morphology (attribute)”
join curr_relationship_f r2 on d.conceptid = r2.sourceid and r2.active = ‘1’ and r2.typeid = ‘363698007’ and r.relationshipgroup = r2.relationshipgroup – “Finding site (attribute)”
where d.typeid = ‘900000000000003001’
and d.active = ‘1’
–and r.destinationid = ‘21964009’
–and r2.destinationid = ‘57171008’
–and d.conceptid = ‘188502002’
and not exists(
select 1
from curr_relationship_f r3
where r.moduleid = r3.moduleid
and r.sourceid = r3.sourceid
–and r.destinationid = r3.destinationid
and r.relationshipgroup = r3.relationshipgroup
and r.typeid = r3.typeid
and r.characteristictypeid = r3.characteristictypeid
and r.modifierid = r3.modifierid
–and r3.active = ‘0’
and r3.effectivetime > r.effectivetime
and not exists(
select 1
from curr_relationship_f r4
where r2.moduleid = r4.moduleid
and r2.sourceid = r4.sourceid
–and r2.destinationid = r4.destinationid
and r2.relationshipgroup = r4.relationshipgroup
and r2.typeid = r4.typeid
and r2.characteristictypeid = r4.characteristictypeid
and r2.modifierid = r4.modifierid
–and r4.active = ‘0’
and r4.effectivetime > r2.effectivetime
order by d.conceptid

SELECT curr_simplemaprefset_f.*
FROM curr_simplemaprefset_f
WHERE curr_simplemaprefset_f.refsetid = ‘?’
AND curr_simplemaprefset_f.maptarget = ‘?’
AND curr_simplemaprefset_f.active = ‘1’
FROM curr_simplemaprefset_f AS snomed_maps
WHERE snomed_maps.moduleid = curr_simplemaprefset_f.moduleid
AND snomed_maps.refsetid = curr_simplemaprefset_f.refsetid
AND snomed_maps.referencedcomponentid = curr_simplemaprefset_f.referencedcomponentid
AND snomed_maps.maptarget = curr_simplemaprefset_f.maptarget
AND snomed_maps.effectivetime > curr_simplemaprefset_f.effectivetime
AND snomed_maps.active = ‘0’))