We are working on getting NAACCR data into the OMOP. The biggest hurdle thus far has been how to represent the characteristics of tumors, which @Mark_Danese talks about here . Our original plan was to map to LOINC but, because of functionality concerns, have since shifted towards SNOMED.
In NAACCR, site, histology and behavior are all represented as ICD-O-3 codes. The SEER website provides a document that maps from ICD-0-3 to ICD-9-CM, ICD-10 (Cause of death) and ICD-10-CM. The plan would be to rework this into a means to automate mapping from ICD-O-3 to ICD-9/10, and then map from ICD-9/10 to SNOMED.
@Patrick_Ryan mentions in a post that itās possible to map from ICD9 to SNOMED using the CONCEPT_RELATIONSHIP table. Is the same true for ICD-10-CM? If needed, nlm.nih.gov also provides a mapping from SNOMED CT to ICD-10-CM.
ICD10CM to SNOMED is a solved problem. Itās part of the OMOP Standardized Vocabularies (read about it here).
The problem is slightly different, I believe. For oncology, we need more information than the diagnosis. We need to cover the TNM system in a reasonable way. Right now, we only have one condition_concept_id, and SNOMED does not (and cannot) create a code for all possible combinations of histology and TNM.
If you are interested, we should start figuring this out and make a proposal of how the model could satisfy these needs. Let us know.
NCI is working on an updated ICD-O-3 to ICD-10 mapping. That is what we were planning on using in the condition occurrence table for identifying the tumor with a reasonable degree of specificity (sorry for not mentioning this before Robert). Then we were planning on putting the detailed information in the observation table. This would include staging information (which includes TNM as well as the full stage). To do this, we should probably create a visit to wrap everything together, although this is not technically required.
By way of background, ICD-O-3 is not a vocabulary in the same sense as ICD-10. It does not have a closed set of codes. It is a concatenation of location (essentially a subset of 3-digit ICD-10 codes), histology (a 4-digit code, for things like āsmall cellā or āadenocarcinomaā), behavior (a 1 digit code meaning āmalignantā, āin situā, etc) and sometimes grade (a 1 digit code for things like āwell differentiatedā, etc.).
Thinking out loud here, I guess we could do all of the unique combinations of these codes. Histology + behavior is actually uniquely defined already by ICD-O-3. And we can drop grade since most people donāt use it as part of ICD-O-3 (for example IARC does not as far as I can tell). So it becomes location plus histology/behavior.
The trick is figuring out which real combinations of codes exist (most histology codes are limited to a small set of tumor locations). This is easily created using SEER data, although there are probably some combinations that are possible, but have never been seen.
Then mapping these to SNOMED is a different task. But creating an ICD-O-3 vocabulary would be great.
I figured it would still be useful to parse the ICD-O-3 to ICD-10-CM worksheets into a usable map.
It was fairly successful, 17,478 out of 50,011 (~35%) unique combinations of site, histology and behavior codes matched to ICD-10-CM codes. Iāve briefly tested it by looking up 5 or so of the resulting ICD-10-CM codes which seem to match the ICD-O-3 descriptions.
@Mark_Danese My guess is that this is a superset of what you are looking for. Regardless, Iāll post it with the hope that you find it helpful.
This is great. Just one thing to keep in mind about the existing SEER to ICD9/10 mappings. The mappings are based on converting ICD-O-3 to ICD-O-2 and using a mapping from around 1999.
Here are two csv files. One with all of the unique cancers in SEER and one with all the unique cancers since 2001 (the one with 2001 in the file name). They have the location, histology, and behavior, plus the count in the SEER data.
@Christian_Reich Iām hoping you can elaborate on this a little more. From how I understand it, the problem stems from needing to lump multiple factors into a single condition entity, given that each of the TNM variables is represented as a condition in the CDM.
After digging around in the vocabularies and SNOMED documentation, it appears that the TNM variables exist in multiple locations within SNOMED. āTumor Observableā is a branch of the Observable Entity tree in SNOMED which I believe also holds the variables we are after.
If we regard TNM variables as āClinical Findingsā then the corresponding domain is Condition. However, if instead we consider them to be āObservable Entitiesā then the corresponding domain is Observation.
Would representing the diagnosis data as observations resolve the issue you brought up? Itās not clear to me how it would tie all together, perhaps using a visit as @Mark_Danese suggested would work. Additionally, would it still be considered an awkward workaround if the data is stored in the observation table but maps to standard concepts?
Not sure I understand. Are you saying because some of the TNM attributes are observable entities they are Obsrevations, and that solves it all?
Here is why that would be awkward:
A metastatic tumor is not an observation, but a condition. A pretty grim one, actually. Same is true if lymph nodes are affected or the grade is high. OMOP has the principle to put concepts where they belong, and not by what the source vocabulary calls it. That is why we have the Domains. We donāt use the SNOMED classifiers like āobservable entityā.
If a patients condition is scattered between different records of different tables, it becomes very hard to create standardized analytics. How is a tool supposed to know it should look in the OBSERVATION table for pieces of a condition?
Traditionally, this problem is solved through pre-coordination, which means different attributes are combined in a single code. SNOMED does that a lot, which it why it is so gigantic. But to encode all the possible combinations in cancer seems even too much for them.
Letās see what we can get out of Markās files. I am on vacation, but later in the month we could take a look.
I am not sure that complete pre-coordination is the right answer for oncology data. I agree with the notion of putting concepts where they belong. But my guess is that most people familiar with oncology data would intuitively consider only pre-cordinating the combination of ICDO3 histology and topography in the condition_occurrence table and data points like staging and grade should be placed as a measurement or observation (wherever the most widely agreed upon standardized vocabulary dictates they belong).
I agree that complete pre-coordination would produce a forest of codes, so I like @mgurleyās thinking. Iād pre-coordinate primary cell type (@mgurleyās topography?), histologic subtype, and grade into a principal term, and stage into a separate term. This split also helps because a given disease state can have different meaningful stages (e.g. surgical vs pathologic).
I can see arguments for putting stage either in observation (semantically fits nicely) or condition_occurrence (keeps the parts of the diagnosis together), and would be happy with either approach. If you make me pick, I see a slight advantage for condition_occurrence, in terms of query formulation, but I could be persuaded otherwise.
Sig: NaCl qs po prn disorientation from pediatric oncologistās perspective.
Most NAACCR information is in LOINC. See below for a response from Regenstreif that I got about a year and a half ago. Based on this, I am thinking that most of the cancer data are observations, and most of these are already in the OHDSI vocabulary.
Technically, something like tumor size could be a measurement, along with number of nodes involved, and perhaps a few other quantitative things. But in general, things like stage (including TNM), grade, behavior, and location seem like observations. Most site specific factors (e.g., ER, PR, HER2, etc) are from quantitative assays, but are stored as categorical values (normal, abnormal, borderline, missing).
From Regenstreif:
Iām not sure if 100% of the SEER data variables are there or not. Weāve worked closely with NAACCR so that all of the variables in their NAACCR Data Standards and Data Dictionary are represented in LOINC. As you probably know, these are the variables sent to central cancer registries so I would presume there is a lot in common with the SEER set.
Here is a query that will show the top level panel codes (underneath which all of the individual data elements are linked) for the NAACCR collections:
Also, just FYI, in the upcoming LOINC release weāll have a few updates to the answer lists associated with the TNM clinical stage codes based on joint work of NAACCR and CDC. It came out of an effort now in HL7 to develop a CDA implementation guide for cancer registry reporting that is using LOINC codes for the observation variables.
I have a generic reply to the problem of one concept per column.
Here is what Christian wrote:
Right now, we only have one condition_concept_id, and SNOMED does not (and cannot) create a code for all possible combinations of histology and TNM
I think this problem will come in different shape for multiple columns. (procedures, for example). We may not be able to anticipate all permutations.
Just to open a pandora box and make a radical suggestion is that SNOMED has an expression language. We could allow some limited postcoordination and allow to store an expression where we now expect a single concept. (with some restrictions on what people express) (for example allow combinations of two concepts only initially)
This means, that every query now has to have subsumption and equivalence. This is outside the capabilities of current db engines.
However current queries may simply operate only on rows that contain a single concept (and still provide a working solution).
@rtmill What functionality concerns did you find mapping to LOINC? @Mark_Danese has stated that LOINC seems to contain better coverage of oncology data points.
After a lot of internal discussion, we realized that each data element in SEER/NAACCR data might be its own vocabulary. For example, morphology (histology + behavior) is ICD-O-3, as is grade. Staging (including TNM) is Collaborative Staging (or AJCC depending on how you want to refer to it). Location is available as ICD-O-3 and can be readily mapped to SNOMED through an existing mapping. Tumor location is ALSO available in a number of different vocabularies including a WHO 2008 definition (see this link). Some staging is SEER specific (e.g., SEER historic stage).
So, it may be easier to think about the data elements in smaller chunks, and to map them as separate vocabularies. That is what we are starting to do internally for our ETL work.