OHDSI Home | Forums | Wiki | Github

Phenotype Phebruary Day 26 - Non-Small Cell Lung Cancer

Happy Saturday everyone!

It is non-small cell lung cancer (NSCLC) day.

I am presenting on behalf of a phantastic group: @dkosareva, @mgurley, @Ajit_Londhe, @CarolynB, Xerxes Pundole, and @jshaw03. And special thanks to @Patrick_Ryan.

I am going to make a little bit of a change today and start talking about what we want to define today.

Our goal today is to come up with a phenotype for NSCLC. We would like to identify patients with a diagnosis of NSCLC, irrespective of the disease stage. We are also interested at all NSCLC patients (prevalent cases) and NOT newly diagnosed (incident) cases.

I have a very long summary of the clinical definition of NSCLC but I am afraid my summary will not do a fair job in describing the disease. NSCLC is not just one disease but many each with its own characteristics, risk factors, behaviors and treatments so I could not convince myself to share the long summary here with you. This Nature Review on NSCLC gives a good overview on the disease. Here, I will focus on a broad definition of the disease as it pertains to our phenotype of interest.

What is NSCLC? NSCLC is type of cancer that originates in the lung. NCI dictionary defines NSCLC as “a group of lung cancers named for the kinds of cells found in the cancer and how the cells look under a microscope. The three main types of non-small cell lung cancer are adenocarcinoma (most common), squamous cell carcinoma, and large cell carcinoma. Non-small cell lung cancer is the most common of the two main types of lung cancer (non-small cell lung cancer and small cell lung cancer (SCLC)).”

Sounds simple and straight forward. All I need to do to find my cohort of NSCLC patients is to first identify tumors that originate in the lung (location or topology of the tumor) and then identify the subset of lung cancer patients with a specific tumor histology: namely adenocarcinoma, squamous cell carcinoma, and large cell carcinoma. The combination of topology+histology is what is called the base diagnosis and was introduced as a part of the Cancer Diagnosis Model in the Oncology Module.

Not that easy :disappointed_relieved:
The ICD-9 and ICD-10 coding systems do not not distinguish between histological types. So we cannot really differentiate NSCLC from SCLC.

What does the literature tell us?

  1. in 2008, Duh et al developed an algorithm to identify SCLC cases from all lung cancer in administrative claims databases. They used American Cancer Society (ACS) and National Comprehensive Cancer Network (NCCN) treatment guidelines and clinical expertise to identify SCLC. Later on, this algorithm was modified by reversing the inclusion and exclusion criteria to identify NSCLC patients in claims databases. The Modified Duh Algorithm uses chemotherapy regimens applied to patients with SCLC as exclusion criteria and first-line chemotherapy regimens administered to patients with NSCLC as inclusion criteria. Although widely used, the algorithm has not been validated.

  2. In 2017, Turner et al used the existing Modified Duh Algorithm and updated to include first-line treatments and test recommendations for patients with NSCLC and SCLC according to 2015 ACS and 2016 NCCN guidelines. The validation was performed using the HealthCore Integrated Research Environment (HIRE)-Oncology clinical linked to HealthCore Integrated Research Database (HIRD). They reported a sensitivity (94.8%), specificity (81.1%), positive predictive value (PPV) (95.3%).

  3. In 2020, Balzi et al proposed another algorithm to distinguish SCLC from NSCLC. They identified lung cancer patients using ICD-9CM diagnosis codes and excluded patients with other malignancy and those initiated Etoposide or Lanreotide in the first 180 days after diagnosis. They reported a sensitivity of 88.8% and a high PPV of 90.2% but a suboptimal specificity (53.7%) and NPV (50%).

    In summary, the NSCLC algorithms in the literature is a combination of lung cancer diagnosis and SCLC and NSCLC treatments.

Can we use these learnings to come up with a good phenotype for NSCLC? Let’s dissect this a bit more:

  1. Not all lung cancer people initiate treatment. The proportion of untreated patients with NSCLC varies by stage, ranging from 7% to 45%. This figure can be as high as 90% for older and more frail patients (David et al). Similar rates have been reported for SCLC. Using any of the proposed algorithm means that we will miss a substantial proportion of our NSCLC patients that did not receive treatment (impacting sensitivity). Remember, our intension was to build a NSCLC phenotype not a phenotype for NSCLC patients who initiated a treatment. An algorithm based on future treatment would therefore give use an unreal picture of all NSCLC patients.

  2. Using treatment information to include/exclude patients can help distinguish the diseases as long as treatments are distinct. The overlap between treatments compromise specificity of the definition. Is that the case?
    The treatment landscape of NSCLC and SCLC and clinical practice guidelines are constantly changing. Developing a universal and reproducible phenotype for NSCLC using treatment information requires regular update and maintenance to ensure the definition captures all the newly approved treatments across geographies. In addition, the performance of the definition needs to be assessed regularly. Introduction of new distinct treatments for each tumor types would lead to better specificity of the definition while approval of the same drug for both conditions will lead to a loss in specificity.

    How similar or different are the SCLC and NSCLC treatments? Let’s take a look at the NSCLC and SCLC treatments as described in HemOnc.

    The following regimens are used for both diseases:

Regimen NSCLC indication(s) SCLC indication(s)
Carboplatin & Etoposide (CE) 1L for advanced or metastatic disease LS induction, LS definitive therapy, LS adjuvant therapy, ES induction, ES relapse refractory disease
Carboplatin & Paclitaxel (CP) & Ipilimumab 1L for advanced or metastatic disease ES induction
Cisplatin & Etoposide (EP) Adjuvant and neo-adjuvant therapy, 1L for advanced or metastatic disease LS induction, LS definitve therapy, ES induction, ES relapse refractory disease
Cisplatin & Irinotecan (IC) 1L for advanced or metastatic disease LS consolidation for upfront therapy, ES induction,
Radiation therapy Definitve therapy for locally advanced disease LS consolidation for upfront therapy
Docetaxel monotherapy 1L for advanced or metastatic disease (elderly or poor performance status), maintenance after 1st, 2nd line, 3rd line and subsequent lines of therapy ES induction
Gemcitabine monotherapy 1L for advanced or metastatic disease (elderly or poor performance status), maintenance after 1st, 3rd line and subsequent lines of therapy Relapse refractory disease
Bevacizumab monotherapy Maintenance therapy after 1st line ES maintenance
Ipilimumab monotherapy Maintenance therapy after 1st line ES maintenance
Amrubicin monotherapy Advanced or metastatic disease, subsequent lines of therapy Relapse refractory disease
Cisplatin, Etoposide, RT Definitive therapy for locally advanced disease LS definitive therapy

So technically, we cannot differentiate NSCLC and SCLC for patients on any of these regimens. We can infer the disease based on the treatment and our prior knowledge from their frequency of their use in either patient population (mmm sounds like a lot of guesswork and not really reproducible :face_with_raised_eyebrow:). Also, the impact of treatment overlap on the specificity and accuracy of the definition depends on the how frequent these treatments are being used in each setting and as mentioned above, this can easily change over time. But similar to Turner et al, we can create a phenotype based on a combination of LC, SCLC treatment and NSCLC treatment: lung cancer patients WITH NSCLC treatment AND NOT SCLC treatment. We can test it out and see what the data tells us.

But wait! there is another problem here: Most cancer treatments are administered in chemotherapy regimens with complex dosing and scheduling in multiple cycles and are often combined with targeted therapies, immunotherapies, surgery or radiotherapy. We do not have that in the data to use in for our definition. what we have is a list of individual drugs or procedures, basically the components of the regimens. There are several ongoing efforts in the community to abstract treatment regimens from different combinations of drugs and procedures. But until then, we are limited to the list of drugs and procedures. Ok! let’s take another look and do a quick comparison between NSCLC and SCLC treatments:

There seems to be a substantial overlap between drugs used for treatment of the two conditions and some of which are commonly used in both conditions (as a part of different regimens).

  1. Using treatment to accurately distinguish SCLC and NSCLC treated populations depends on the completeness of treatment information in our datasources. For example, oncology EMR data coming from oncology clinics (a common source of data used for observational oncology research) have a good capture of systemic antineoplastic use but lack information on surgery, radiotherapy, and other inpatient procedures. Similarly Claims are limited in capturing inpatient administration. Each datasource is showing us one snapshot and a different view of the patients treatment journey and not all. Can we rely on these limited snapshots to get to our NSCLC patients?

  2. What about multiple primaries? Up to 20% of lung cancer patients develop a secondary tumor and the most common secondary malignancy in patients with lung cancer is in the lung: ~57% of patients with SCLCL develop NSCLC. Additionally, a history of previous malignancy in patients with lung cancer is reported in ~11% of cases. Having multiple primary tumors complicate the use of treatment for identification of NSCLC patients. But if we exclude them, our definition will only identify a subset of all NSCLC patients with no other primaries who initiated treatment.

  3. A definition based on treatment uses future events (here, treatment) to define the patients population. This means we will only capture a subset of patients who survived and received a treatment. Sounds like immortal time (AVOID IT BY ALL MEANS). If I want to describe use this definition to do a characterization of NSCLC patients and their outcomes after diagnosis of NSCLC, I will be looking at the subset who have survived and are being treated (selection bias). If I want to describe treatments patterns, I am basically describing what I have used to define my cohort. I cannot see any off-label use or disparities in treatment. The definition would work if I was interested in NSCLC treated patients as a phenotype and correctly identify the index date but that was not what we intended to create.

How should we move forward?

  1. Do not create a phenotype in the presence of this much uncertainty. Instead, let’s work together to enrich the data either through linkage to tumor registry or other sources with information on tumor histology and other necessary attributes or by looking into the notes. While we strongly recommend and encourage the community to join the oncology WG and others in enhancing oncology data, we still want to comprehensively assess the limitations and drawback of the current data in defining NSCLC.

  2. Create the following phenotypes:

    • Probably NSCLC treated cohort: Lung cancer diagnosis AND NSCLC specific treatment initiation
    • Probably SCLC treated cohort:Lung cancer diagnosis AND SCLC specific treatment initiation
    • We have no clue lung cancer cohort: Lung cancer diagnosis AND initiation of treatments commonly used in both conditions

    We will look at the cohort diagnostics results and also do a simple characterization to further understand the attrition based on different treatments and potential differences in the patient population.

  3. Create a definition for NSCLC and SCLC following the Oncology Module recommendation from a combination of histology and topology. We will use these definitions (in addition to the previous ones) on a couple of our oncology data sources to investigate the potential overlap and the degree of misclassification associated with using treatments for defining the cohorts.

1 Like

@agolozar this is a tremendous lesson that we can all learn from: sometimes you have to consider that one of your options is to decide that the data are not appropriate for the question of interest and the best path forward might be to stop. And also a nice demonstration of the important work that Oncology WG is doing provide better modeling of oncology disease state and treatment regimens, which can further enable exploration.

In an ideal world, we would have a formal objective diagnostic that we could apply in situations like this so that we would have evidence to support a ‘do nothing’ decision.

I think this example really nicely illustrates the importance of quantifying measurement error and then making decisions for your analysis based on the acceptability of that error in your particular question. What I learned from reviewing your work and cited papers was that there seems to be general consensus for phenotyping ‘lung cancer’ (lc) generally, but not the subtypes of ‘small cell lung cancer’ (sclc) and ‘non small cell lung cancer’ (nsclc). Treatment guidelines to not clearly separate these subtypes, but have been used as a rough proxy. It seems most diagnosis codes or diagnostic procedures arent helpful for differentiation.

I dont know this space well, but if I frame this in measurement error terms alone, then I wonder if a phenotype algorithm for LC can be sufficient for use as a model for NSCLC. if NSCLC makes up 80% of all lung cancers (as some estimate), then that would mean that the LC algorithm would yield high sensitivity (as long as a lung cancer is clinically recognized) and have an expected positive predictive value of <80% (because 1 in 5 cases will actually be SCLC, plus whatever other false positives the LC algorithm generates). Specificity wouldnt be so bad, given that the incidence of SCLC is small in relation to the general population of true negatives. In fact, I think if we were to use PheValuator (thanks again @jswerdel ) for LC, then it would seem that we could correct the measurement error estimates using the 80/20 nsclc/sclc ratio and get a decent approximation.

I suspect these error estimates wont look all that bad in relation to many other phenotypes weve seen throughout phenotype phebruary. And depending on the use case, that error may be acceptable. For example, if NSCLC is an outcome, this error would likely be in range of other events we’ve seen. If NSCLC is being used as indication for treatment comparative effrctiveness studies, it is likely fine. As you highlight, if you desire to make a treatment pathway, then it could be a little noisy, but I wonder how bad it would be with only 20% contamination.

As you nicely highlight, we could employ the tricks that others have done to exclude sclc treatments in an attempt to improve ppv and specificity, though we know sclc is not always treated so the sensitivity issues of sclc treatments will result in persistent specificity issues in NSCLC = LC - SCLC. And for treatments of Sclc that can also be considered for Nsclc, we need to be mindful that exclusion there could also sacrifice sensitivity in NSCLC. CohortDiagnostics could be used to compare LC and NSCLC to see how much impact those exclusions really make.

Its all use case dependent, but I suspect that sometimes a ‘noisy’ answer using a LC cohort as a proxy for NSCLC may be better than no answer at all, particularly given that most datasets wont have elements necessary to get less noisy. But if the answer you need demands precision, then knowing that the error is too large is good evidence to use to justify the ‘no answer’.

1 Like

My takeaway from this NSCLC phenotype and the Triple negative breast cancer phenotype is a call to arms to obtain the missing data. And try to refrain from treatment-based surrogates or proxy cohorts.

With some data assets, obtaining the missing data may not be possible (likely claims data), but with EHR data sets, it is possible to enrich tumor site, histology, ER status, PR status, and HER2 status (along with other pathology findings). Either by using:

  • NLP abstraction/curation pipelines applied to pathology reports in clinical notes.
  • Tumor registry data that has these data points pre-abstracted.

Regarding NLP, a proposal is brewing in the NLP working group that advocates rescuing NLP outputs from being stranded in the NOTE_NLP table and promoting them to the standardized clinical tables (when a threshold of confidence is met).

1 Like
t