OHDSI Home | Forums | Wiki | Github

Cancer diagnosis and stage ingestion process - what to do when data sources vary in specificity and new data enables more specificity

(John J Brusk, MPH) #1

See below email text for context as I was originally responding to that…

I have been going down the path of option 2, leveraging the LOINC codes in the Measurement domain for stage. This is pragmatic for us, because basic staging for solid tumors is readily available for us in the feed we get from oncology provider offices seeking verification to use a non NCCN compendia protocol for a particular patient. I can automate ingestion of this staging from this feed, but the TNM specificity either requires manual clinician input of this information based on the path reports we receive or nlp extraction, which is currently asynchronous for us and requires the ability to input those measures at some point later once the extraction is complete. Thus, we will always have stage group (LOINC 21908-9) and subsequently we hope to extract TNM for as many cases as possible which I plan to use LOINC 21905-5, 21906-3 and 21907-1 to record T, N and M, respectively. However, this is yet another option from what you described. Is this wrong? Also, we are relating this on the measurement domain to the visit occurrence id associated with the condition occurrence id when the patient’s cancer diagnosis was first documented. Shouldn’t that work?

As for ICDO, what you have written makes complete sense. We currently get ICD10 from claims and via our protocol review application and our clinicians document the specific histology (usually and unfortunately as simply a more descriptive diagnosis rather than a particular code), and we are in process of mapping these to ICDOs which will enable us to use those for condition occurrence rather than just the ICD10… however, if I already have ICD10 based condition occurrences loaded, would it be appropriate to associate another condition occurrence ID based on ICDO to the same visit occurrence id to which we are associating a condition occurrence using ICD10? Or would I update the existing condition occurrence ids from ICD10 to ICDO?

These seem like two different questions but are related: If best information available on initial load was less than desired specificity (eg. ICD10s or LOINC stage group) and after analytics and clinical review, we get more specific information, should the condition occurrences and measurements be updated to reflect the more specific information, or do I just add additional measurement ids and condition occurrence ids with the greater specificity (eg ICDOs or LOINIC TNM codes) associated with the same visit occurrence ids? Thanks, JB

Email text:
The OHDSI Oncology Group is trying to converge on a standard way to represent oncology diagnoses and oncology diagnosis modifiers (like staging, grading, biomarkers etc.) within OMOP. We have created the OMOP Oncology Extension Extension (the oncology extension). In the oncology extension, we have adopted ICDO as the standard way to represent oncology diagnoses. We have pre-coordinated the most common ICDO histology/site combinations and mapped or subsumed them to SNOMED concepts. The extension recommends that these pre-coordinated, ICDOhistology /site combination concepts land in the CONDITION_OCCURRENCE.condition_concept_id field and the EPISODE.episode_object_concept_id field.
However, we want to be able to further refine these oncology diagnoses with oncology diagnosis modifiers. Unfortunately, unlike with ICDO base oncology diagnoses, there is no standardized vocabulary of oncology diagnosis modifiers. Further, OMOP currently contains vocabularies with duplicative, overlapping options for representing many oncology diagnosis modifiers. Thus, currently, an OMOP ETL developer is forced to choose whatever “seems” right.

Let’s say an ETL developer wants to record an oncology diagnosis of ICDO histology ‘8140/3 Adenocarcinoma, NOS’ and ICDO site ‘C18.2 Ascending colon’. Currently the oncology extension recommends to map this ICDO histology/site combination of 8140/3-C18.2 to ‘Adenocarcinoma of ascending colon’ OMOP Concept ID 44502439. But if the ETL developer wants further modify this oncology diagnosis with, for example, pathological TNM Staging for AJCC Version 6 T=pT3, the ETL the developer is faced with a couple options:

Option 1: Map T=pT3 to SNOMED code 395707006 ‘pT3: Tumor invades through the muscularis propria into the subserosa or into non-peritonealized pericolic or perirectal tissues’ OMOP Concept ID 4193681 in the Condition domain. But then how does the ETL developer relate this entry in CONDITION_OCCURREDNCE to the entry in CONDITION_OCCURRENCE for the base oncology diagnosis? FACT_RELATIONSHIP?

Option 2: Map T=pT3 to LOINC code 21899-0 ‘Primary tumor.pathology Cancer’ OMOP Concept Id 3016308 in the Measurement domain and LOINC Answer ID LA3624-9 ‘T3’ OMOP Concept Id 45876313 in the Meas Value domain. Again, how does the ETL developer relate this entry in the MEASUREMENT table to the entry in CONDITION_OCCURRENCE for the base oncology diagnosis? FACT_RELATIONSHIP?

Either of these options could work. But the OMOP community needs to converge on a ‘standard’ representation.

The oncology extension recommends placing oncology diagnosis modifiers within the MEASUREMENT table. These MEASUREMENT entries should point to a parent CONDITION_OCCURRENCE or EPISODE entry by populating the new polymorphic foreign key: MEASUREMENT.modifier_of_field_concept_id and MEASUREMENT.modifier_of_event_id. This would seem to point to using ‘Option 2’. Unfortunately, most EHR/LIMS source systems do not contain discrete oncology diagnosis modifiers. The College of American Pathologists (CAP) has the CAP Cancer Protocols that does have a machine-parseable distribution named CAP eCC. The CAP Cancer Protocols standard is a comprehensive, frequently updated vocabulary of oncology diagnosis modifiers that is tightly bound to actual clinical practice. Adherence to the CAP Cancer Protocols is mandatory for CAP and COC accreditation. Unfortunately, CAP eCC is a proprietary standard that is not mapped to standardized vocabulary (like SNOMED or LOINC) and most EHR/LIMS do not discretely encode oncology diagnosis modifiers. The University of Nebraska Medical Center is working on normalizing the CAP eCC proprietary format to open-source and standardized vocabularies via the Nebraska Lexicon project. However, at this time the Nebraska Lexicon only covers a small number anatomic sites.
In the US, the most widely available source system containing “discrete” oncology diagnosis modifiers is NAACCR formatted tumor registry data. NAACCR is a data dictionary format for the tracking of oncology diagnoses, oncology diagnosis modifiers and oncology treatments. All US facilities diagnosing and treating cancer patients are mandated to report their data in the NAACCR format to federal and state agencies. Most NAACCR data is manually abstracted from patient charts by certified tumor registorars.
The oncology extension wants to be able to support the ingestion of oncology diagnosis modifiers from NAACCR tumor registry data. To enable this use case, the oncology extension recommends adopting the NAACCR tumor registry vocabulary as the standard OMOP oncology diagnosis modifier vocabulary. Thus, an Option 3 arises that dictates that cancer staging data should be encoded in OMOP in the NAACCR vocabulary. The ingestion of the NAACCR data format is currently under construction by the OMOP vocabulary team.
In the future, the hope is that NAACCR vocabulary could be transitioned from the standard OMOP oncology diagnosis modifier vocabulary to a source vocabulary. The current vision is that the Nebraska Lexicon will be adopted as the standard oncology diagnosis modifier vocabulary and a mapping will need to NAACCR vocabulary to the standardized SNOMED/LOINC concepts specified by the the Nebraska Lexicon
We are going to discuss oncology diagnosis modifiers at our next OHDSI Oncology Workgroup meeting. We will review a concrete SQL example mapping cancer staging to NAACCR in OMOP. Either at the meeting or via email, please let us know your thoughts on the current direction we are taking. We definitely what to hear other perspectives.

(Christian Reich) #2

The problem is that these are cancer specific. Take bladder cancer and regional lymph nodes. They are:

  • NX: The regional lymph nodes cannot be evaluated.
  • N0: No spread to the regional lymph nodes.
  • N1: Single regional lymph node in the pelvis.
  • N2: 2 or more regional lymph nodes in the pelvis.
  • N3: Common iliac lymph nodes, which are located behind the major arteries in the pelvis, above the bladder.

LOINC doesn’t provide that kind of detail.

ICD10 is not a Standard vocabulary. We standardize the semantic meaning of a concept into so-called standard concepts. For cancer diagnostic, they are derived from SNOMED and ICDO (for the more granual concepts). ICD10 gets mapped to these standard concepts.

And besides: ICD10 describes the anatomical structure of the tumor I’d say ok. Not great. But the histology is practically missing. So, you cannot distinguish between adenocarcinoma of the lung and small cell lung cancer. Which makes ICD10 pretty useless for systematic cancer research.

Bottom line: ICD10, SNOMED and LOINC are not granular enough for cancer research. We need vocabularies that reach deeper. For histology and topology we picked ICDO. For all other cancer attributes we picked the NAACCR pathology report terminology. For treatment we picked HemOnc. Now we are validating how well these work in real data.

Good question. The way we are planning on doing it, which is the standard OMOP Vocabulary approach, is to have a hierarchy of concepts which are higher up (less granular) and further down (more granular). You then map whatever you data have. Which means, if the inital diagnosis is ICD10-based you will have the equivalent high-level SNOMEd, and later on, when they did the diagnostic work-up, you get a whole lot of detail and represent it in the new vocabularies. In either case you will know the disease is lung cancer, because they are connected through the hierarchical system.

Neither Option 1 or 2. There are the new NAACCR-based tumor attributes. They will live in the MEASUREMENT table, and be connected to a record in the new EPISODE table. We’ll publish the exact details soon. In the mean time, please join the Oncology WG and talk to team members about implementation details.

(John J Brusk, MPH) #3

Thank you this helps. I will leverage the new NAACCR based tumor attributes. To confirm, these would still leverage LOINC vocabulary right? Specifically 21899-0 Primary tumor.pathology Cancer, 21900-6 Regional lymph nodes.pathology [Class] Cancer, 21901-4 Distant metastases.pathology [Class] Cancer and 21902-2 Stage group.pathology Cancer for T, N, M and stage group, respectively.

As for using HemOnc for treatment, I find it difficult as the majority of the protocols we construct and recommend are non-compendia and will likely not be able to be identified there… is there an issue with just using the HCPCS JCodes input into drug exposure and drug exposure era domains to record the treatment rendered? Thanks again and I joined my first oncology WG call this week and plan to join subsequent meetings. JB

(Christian Reich) #4

Right now, no. They would need to be mapped. Or placed into the hierarchy.

Understood. Can you show us an example?

That’s not changing (except we map HCPCS to RxNorm, but that’s working). You will still keep the detailed drug information. Whether or not we can infer the regimen - that’s what we want to test.

Please come. We need people like you with real data and use cases for getting this right.

(John J Brusk, MPH) #5

Thanks again. As for a few protocol examples… first is an ancillary protocol (for nausea) that I could construct using separate component classes in HemOnc but the regimen does not exist: Ondansetron PO, Palonosetron, and Dexamethasone. Would you suggest using the 3 components to represent the regimen? Another example is a main protocol for acute lymphoblastic leukemia: Dasatinib, Prednisone, Vincristine, Methotrexate, Mercaptopurine (D-POMP)… in this case the D, P, V regimen is present but not with the Methotrexate and Mercaptopurine… and the latter two Ms are present together as a regimen… so do I encode all of the components separately or the two regimens that do exist to represent the full combination? Finally, I understand emerging drugs (like polatuzumab) will not be there (doesn’t even have a q code yet), but with newer drugs that have a j code (Palifermin) but not represented yet as a component in HemOnc, I feel that I can only represented them as drugs or procedures in drug exposure and/or procedure occurrence. Is that appropriate? I fear inconsistency will grow between the drug exposure, procedure occurrence and the HemOnc regimes I am able to find or construct for ingestion into the new episode domain. Thanks again, JB

(Michael Gurley) #6

For those regimens that you can not comfortably map to an existing Hemonc regimen, I think the recommendation would be to map the Episode.episode_object_concept_id to a higher-level concept like ‘Chemotherapy’ or ‘Immunotherapy’, but with links to DRUG_EXPOSURE table that could reveal the RxNorm ingredients through the OMOP vocabulary hierarchy. If enough people did this across enough data sources, this would be a way to surface novel new regimens from data. Which could in turn be fed back to Hemonc.org to improve their breadth. Breadth that would now be driven by real world evidence.

(Christian Reich) #7


All good comments.

Define “you could construct”. Usually, regimen are generally accepted drug cocktails that come with a certain set of indications. But there is nothing preventing you from having patients receiving combination therapies. To define them for research create a cohort.

As @mgurley said: The HemOnc folks will take care of these requests. But like with all concepts: If you need your own private one right now today you need to create it in the >2Billion space. Won’t be interoperable though. Consider this one submitted.

Not true. HemOnc scans all drugs, also those that are still in development. But again, they will depend on the community to help them.

That’s irrelevant. The CMS decides to create their codes when they need providers to be able to charge for the treatment. We will have the compound way earlier through RxNorm and HemOnc.

(John J Brusk, MPH) #8

Thank you! I like the idea of defining the specific combo therapies in cohorts. This will work out very well. Thanks again, JB

(Christian Reich) #9


Answering while Jeremy (father of HemOnc) will get his OHDSI Forum account on his behalf (copy and paste from his email):

We have some of them on HemOnc.org, for example on this page. But, they are not in the standard format of the rest of the site so they don’t get ingested, yet…if there were a demand for them, we could transform content on these pages to enable parsing. Our focus to date has been on cancer-modifying protocols, and to a lesser extent protocols used for benign hematologic conditions.

Send us a reference and we’ll be happy to add it! A cursory review of the literature doesn’t reveal anything obvious; this paper sort of describes D-POMP (Dasatinib and low-intensity chemotherapy in elderly patients with Philadelphia chromosome–positive ALL; Rousselot et al. 2016) but it looks pretty different from standard POMP…

we already have it on the website – so it’ll be picked up the next time we run a parse! Since the drug was approved 7 days ago (the same day the vocabulary was released?) it’s not that fair to expect that we would already have it in the ontology!

I’ve created a page so that it will get pulled into the vocabulary at the next parse: https://hemonc.org/wiki/Palifermin_(Kepivance). But, as with the ancillary protocols comment above, it is unlikely that we would add protocols of palifermin anytime in the forseeable future, unless we see it listed as a supportive medication for a particular chemotherapy regimen.

(Michael Gurley) #10


It sounds like you are first obtaining “clinical” staging group (LOINC code 21908-9 ) from the “feed you(sic) get from oncology provider”. “Clinical” staging of cancer is different than “pathological” staging. The LOINC codes you reference (21905-5, 21906-3 and 21907-1) apply for “clinical staging”.

  • 21905-5 Clinical T
  • 21906-3 Clinical N
  • 21907-1 Clinical M

The LOINC code from my Option 2 (21899-0) is appropriate for tracking “pathological” staging.

  • 21899-0 Pathological T
  • 21900-6 Pathological N
  • 21901-4 Pathological M

It sounds like you are then later trying to obtain the “pathological” staging from NLP-extraction/assisted chart abstraction. If you are truly getting “pathological” staging, you should use the “pathological” staging LOINC codes. Ideally, each tumor in an oncology analytic data set would have both “clinical” and “pathological” staging tracked and separately stored.

As for relating the capture of staging in “the Measurement domain to the visit occurrence id associated with the condition occurrence id when the patient’s cancer diagnosis was first documented”, the problem with that approach is what if there is more than one cancer diagnosis associated to the chosen visit_occurrence_id? You would have an ambiguity as to which cancer diagnosis owns the staging Measurement entry.

This is why the OMOP Oncology extension recommends associating oncology diagnosis modifiers with an entry in CONDITION_OCCURRENCE or EPISODE. The ambiguity is thus removed. In fact, the logic you are applying to find when “the patient’s cancer diagnosis was first documented”, will likely be the recommended algorithm to create an entry in the EPISODE table from the stream of entries in CONDITION_OCCURRENCE. All connected and swept together by EPISODE_EVENT. This strategy will be forced upon folks(like you) that don’t have access to tumor registry data that explicitly declares the (normally pathology-confirmed) begin date of an oncology diagnosis in the fine grained ICDO vocabulary.

When your “clinicians document the specific histology”, if you then can map to ICDO, the OMOP Oncology extension would recommend placing that new more fine-grained condition as a new entry in the CONDITION_OCCURRENCE but then updating the existing entry in the EPISODE table to the more fine-grained ICDO oncology diagnosis.

The case you are bringing up of using “the best information available on initial load with(sic) less than desired specificity” but then “after analytics and clinical review, getting more specific information”, is exactly the problem that the EPISODE table is trying to address. Picking and promoting from the observational welter a definitive begin date and the most fine grained representation of an oncology diagnosis. And then ideally associating as many oncology diagnosis modifiers to the chosen EPISODE entry.

(Jeremy Warner) #11

Hi everyone, I’m on the forum now! Please @ me with questions related to HemOnc and I’ll do my best to answer quickly!

(Dramacloak) #12

When is the next Oncology working group meeting? I’d love to join the effort but can’t find the meeting details. Thanks.

(Michael Gurley) #13

Normally every Tuesday at 11:00AM EST. But we are off next Tuesday because of July 4th. So next one will be July 9th. See here for connection information:

Email me your email address to m-gurley@northwestern.edu to get on the distribution list.