OHDSI Home | Forums | Wiki | Github

Phenotype Phebruary Day 19 - Triple negative breast cancer

First I want to thank everyone for the informative phenotype discussions so far. I have learned a lot from reading them.

Along with lots of help from my colleagues Darya Kosareva and @agolozar I have created a few cohort definitions for triple negative breast cancer to share. In the process I have learned some lessons about the importance of understanding the vocabulary and how clinical concepts are recorded in actual data. I’d like to summarize the cohorts first then put the clinical definition and references in then next post. Keep in mind that I rely on others for clinical expertise since I have no clinical training and have been lucky to have many people around me with significant clinical expertise.

Triple negative breast cancer (TNBC) is an aggressive form of breast cancer that is estrogen receptor (ER), progesterone receptor (PR) negative, and HER2 negative. There are few treatment options and the prognosis is bad. I’m trying to identify incident cases of TNBC. I’ll put a more complete clinical description in the next post.

Initially I expected standardization to allow for the creation of one cohort definition that captures persons meeting a particular phenotype definition across a network of databases without having direct access to the data or knowing how the source data was recorded. This expectation is based on two assumptions.

  1. We have a standard vocabulary where each clinical idea has a unique representation
  2. Source data is accurately mapped into that standard representation

There is a lot to unpack in those two assumptions. (e.g. what do I mean by ‘accurately mapped’). Putting that aside for a moment I started with a naively simple cohort definition that I think represents how some OMOP users would like to use a standardized clinical database. TNBC is very clearly a condition. When searching the vocabulary do indeed find a single standard concept for Triple-negative breast cancer (45768522). I might expect that this concept would provide good sensitivity and specificity for the TNBC patients in any OMOP database. So the simplest possible cohort definition that I’d expect to be entirely database agnostic is based on this single condition concept.

Triple negative breast cancer (single diagnosis)

This definition is problematic for a few reasons. One issue with 45768522 is that no ICD10 codes are mapped to it (because I don’t think there are any specific triple negative ICD10 codes) so this code isn’t really used in OHDSI databases and will miss many/most TNBC patients. PHEOBE shows only about 1700 records across OHDSI with concept ID 45768522 . We need to use a more general clinical concept of “primary malignant breast cancer” and then narrow in on TNBC using measurements (ER/PR/HER2). This requires our data to have measurement data which excludes some claims databases. We also have to decide whether to exclude breast cancer patients who are clearly not triple negative or include patients who have negative test results for ER/PR/HER2. When we started looking into these tests we found many relevant measurement concepts. Some of them are specific to one of ER/PR/HER2 while others are a panel that include all three. We also have the option of using measurement values. OMOP has value concepts for ER negative, ER positive, PR negative, PR positive, HER 2 negative, HER2 positive as well as a measurement value for triple negative. However as Christian points out below these are legacy concepts and will be deprecated soon so we should also include the new OMOP Genomic concepts. However if we require the new measurement concepts to be present we will lose patients in databases that are not using these measurement concepts yet. In short it gets complicated quickly.

We created a sensitive (exclusion based) cohort and an specific (inclusion based) cohort using a general condition concept and measurements/measurement values. We also used HER2 targeting drugs to further exclude non triple negative patients.

triple negative breast cancer condition + measurement (sensitive definition)

triple negative breast cancer condition + measurement (specific definition)

We also have measurement values we can use so I tried creating another specific cohort based on the occurrence of ER-, PR-, and HER2- measurement values alone. I tried another cohort using just the the combined triple negative measurement value.

Triple negative breast cancer (single diagnosis plus ER-PR-HER2 negative measurement values)

Triple negative breast cancer (single diagnosis plus single ER-PR-HER2 negative measurement value)

I think that these might work on some databases but I also believe I have only scratched the surface of this phenotype definition. I didn’t get into the is the issue of finding the correct index data described in this poster but I think requiring an ER/PR/HER2 measurement within some time frame around the diagnosis helps to anchor on the correct incident index date. Some of the published algorithms we looked required two diagnoses at least X days but no more than Y days apart to exclude rule out diagnoses. This logic makes sense for claims and EHR databases but not for registry databases.

The main takeaway for me is that phenotyping is not database agnostic. There are many possible approaches and the best one depends depends on the type of source data and how it was mapped. I’m not sure it is possible to develop a single rule based TNBC phenotype algorithm in Atlas that works well on any OMOP database. Phenotyping algorithms sit between ETL and standardized analytics where ETL is entirely database specific and standardized analyses are database agnostic. I’m glad the community is building a library of phenotypes for us to use in standardized analytics!

1 Like

Clinical description


Triple negative breast cancer is a subtype of breast cancers classified based on DNA microarray profiles, where the tumor is characterized by a lack of expression of 3 primary breast tumor markers: estrogen, progesterone, and ERBB2 (HER2) receptors when tested using immunohistochemistry and/or fluorescence in situ hybridization (FISH) on formalin-fixed and paraffin embedded tissue. It appears that there is a phenotype crossover between basaloid tumors and TN tumors as basaloid tumors are most often TN tumors as per Chacon et al (2010)

Out of all breast cancers, TNBC occupies about 10 to 20%. It accounts for close to 25% of high-grade tumors. TNBC is found more often in younger premenopausal women and in women of African descent. It has a strong association with obesity


  • Histologic and immunohistological features of triple negative breast cancer
  • The neoplasms typically have pushing margins, with central necrotic areas.
  • A prominent lymphocytic infiltrare can sometimes be seen at the periphery of the tumor.
  • The neoplastic cells are arranged in solid sheets or nests. Numerous mitotic figures are visualized.
  • The neoplastic cells are negative for oestrogen receptor, progesterone receptor and HER2 immunohistochemical staining.


  • Imaging (MRI, CT, Ultrasound)
  • Biopsy of the tumor
  • Immunohistochemical examination of tumor cells to determine ER, PR, and HER2 status.
  • Genetic testing (BRCA mutation)


  • Surgery (lumpectomy or mastectomy) with lymph node biopsy.
  • Chemotherapy (adriamycin + cytoxan, paclitaxel with bevacizumab)
  • Possible immunotherapy (PDL1)
  • Possible neo-adjuvant chemotherapy


TNBC tends to be the most aggressive of all breast cancers as it shows rapid growth. It has a relatively poor prognosis based on overall survival and disease-free factors as treatments are limited due to a lack of effective therapeutic targets leaving systemic chemotherapy as the treatment of choice to improve outcomes. It has a weak association between tumor size and lymph node involvement as even small tumors tend to involve lymph nodes. TNBC carries a high risk of early recurrence with peak between 1 and 3 years after diagnosis. After progression to metastatic disease, TNBC has a higher prevalence of brain metastases. There is a rapid progression from the onset of metastatic disease to death. Most TNBC deaths occur in the 5 years after diagnosis.


Cleator, S., Heller, W., & Coombes, R. C. (2007). Triple-negative breast cancer: therapeutic options. The lancet oncology, 8(3), 235-244

Kapp, A. V., Jeffrey, S. S., Langerød, A., Børresen-Dale, A. L., Han, W., Noh, D. Y., … & Tibshirani, R. (2006). Discovery and validation of breast cancer subtypes. BMC genomics, 7(1), 1-15.

Foulkes, W. D., Smith, I. E., & Reis-Filho, J. S. (2010). Triple-negative breast cancer. New England journal of medicine, 363(20), 1938-1948.

Chacón, R. D., & Costanzo, M. V. (2010). Triple-negative breast cancer. Breast cancer research, 12(2), 1-9.

Dai X, Li T, Bai Z, et al. Breast cancer intrinsic subtype classification, clinical use and future trends. Am J Cancer Res. 2015;5(10):2929-2943. Published 2015 Sep 15.


Thank you @Adam_Black for leading this thread. Do you have a draft of the clinical description we can work on

We could use MS teams to host the document and link it here for others to see.

We can then move the final version here

1 Like


Are you employing the proper Oncology Extension? The diagnosis of breast cancer, like all tumors, is histology+topography only. All other attributes are Cancer Modifiers and OMOP Genomic concepts:

All combined with the value_as_concept_id Negative.

Of course, it makes sense to add the legacy concepts like ER Negative, PR Negative, HER2 Negative (Triple Negative) and Triple-negative breast cancer, which we haven’t yet de-standardized. That will happen soon.

Ah, see! I forgot one intention for complex phenotype definitions in the other posts: workaround to vocabulary issues.

1 Like

Thanks @Adam_Black , great stuff! I like how you’ve laid out the logic for how you arrived at your alternative definitions. To reinforce some of the points you’ve made, I’ll reframe the problem as best I understand it.

Triple negative breast cancer has a clear clinical description, that you’ve nicely laid out. It’s breast cancer in the absence of three markers, each of which can be directly measured. Measurement of these markers may be considered, because identification of the appropriate cancer subtype may impact treatment plan and prognosis.

The issue you raise here is the distinction between what is biologically observable in the patient vs. what is actually observed in the patient vs. what is actually recorded in the patient’s data. TNBC is directly observable, sometimes observed, but rarely recorded directly in the data. ‘Breast cancer’ is directly observable, often observed and recorded in the data, but the associated markers for estogen receptor, progresterone receptor, and HER2 negativity may not be observed or recorded. Effectively, the marker measurements represent ‘missing data’. For those without missing data, we can use the measurements to identify cases with negative markers and exclude cases with positive markers. For those with missing data, we need to decide if we want to include them - increasing sensitivity at the cost of specificity - or exclude them - decreasing sensitivity to maintain specificity. Given that TNBC is 10-20% of breast cancers, then the likelihood is that most patients with ‘missing data’ on the markers are not TNBC. Further refinement may be possible by developing criteria based on treatments post-diagnosis, to whatever extent that can differentiate the breast cancer subtypes.

I would make a slight friendly amendment to your main takeaway: phenotype PERFORMANCE is not database agnostic. We clearly need to apply phenotype algorithms to different data to understand their performance, and should be very careful in generalizing the performance to other data. But I remain optimistic that we can, as a community, establish a phenotype development and evaluation process that can be applied consistently across databases. I don’t think we should make presumptions about what phenotype algorithm will work best in which database, I’d argue we really need to drive to make this a more objective and empirical conclusion that is derived from the systematic application of algorithms against sources, and comparative assessment of the results.

1 Like

Thank you for the feedback @Patrick_Ryan.

I completely agree.

Then each database will have a set of phenotypes and algorithms that we know work well on it (PheQD).

Shouldn’t the analytic layer then be able to automatically use the best performing phenotype algorithm for each database? Or run every analysis using every phenotype algorithm on every database?

Thank you for demonstrating another use of concept prevalence data. When we take 1700 in the context of all the conceptId for malignant breast tumor in OHDSI network - it does look small

This is an excellent discussion that ends with what some of us think as the question of a datasource “fit for use”. This thinking is independent of OMOP (data model/vocabulary) or OHDSI tools.

As @Patrick_Ryan said this is missing data problem leading to the trade off of various misclassification errors.

Just a comment on our tools - the missing data problem is not unique to OHDSI/OMOP, and is a universal problem for all alternate systems even if the data is in native format.

And - we need to some how stratify the data sources based on the population it represents and the amount of estimated missingness of the attributes needed for the phenotype. My thinking is that the operating characteristics of a cohort definition should be similar within datasources that are similar (in terms of underlying population and missingness).
See this issue on Cohort Diagnostics https://github.com/OHDSI/CohortDiagnostics/issues/480

So - i think: phenotype development and evaluation is an independent activity to a study. Phenotype development and evaluation starts from a clinical description and ends with 1 more evaluated cohort definitions that try to implement the clinical description. The evaluation should have the operating characteristics observed on the datasources the cohort definition was evaluated.

Now - analytic layer or study - should specify the clinical description. it should then pick the cohort definition based on its review of the trade off of the various measurement/misclassification errors. eg. some studies (e.g. disease surveillance) may want a more sensitive cohort definition, while others may need a more specific etc.

I think we have an opportunity to make this decision more clear and more research is needed.

1 Like

For purposes of phenotype EVALUATION, I think it would be useful to aspire to have all algorithms run on all databases, as that we can empirically learn which algorithm is best for which database (which could be potentially context specific). Then, for a given analysis, a researcher could use this evidence to make the choices among the available phenotypes of which to run.


@agolozar had an informative anecdote from pioneer about the need to be able to use different algorithms on different databases in a single study. The logic we use to identity a cohort/phenotype in one database might be different from the logic we use to identify the same cohort/phenotype in a different database. Both databases might be fit for use but with different cohort logic. The idea I’m getting at is the ability to design studies at the phenotype level and use different cohort definitions on different databases. It seems like a study definition could be completely database agnostic but I’m not sure about that.

These cohorts are a work in progress and I should be able to run diagnostics on this and other oncology cohorts and share the results. We should have some good cancer cohort definitions by the end of Phenotype Phebruary.

i cant yet agree with you here @Adam_Black .

Scenario 1: Lets say we have two datasources - both don’t have missing data problem (i.e. they have complete capture of all the elements requested by clinical description). Then, after OMOP standardization, one cohort definition would be sufficient.

Scenario 2: We have two datasources - but one of them has a lot more missingness compared to the other, and we want to make sure both data sources ARE used in the study. The only solution then would be - to some how use ‘surrogates’ that approximate the missing variables in the data source that has missing data. This will mean - we have to do a lot of cohort definition acrobatics - and now we cannot use the same cohort definition.

I think scenario 2 is something we should avoid. why do we want to force fit a datasource with known problem. We may be doing more harm then good?


Great work @Adam_Black!

I think this post provides a really nice example of a phenotype whose definition could potentially benefit from the ability to integrate and/or leverage external knowledge, specifically regarding the underlying molecular mechanisms that have been identified as important in this disease. We have spoken about this type of approach briefly in the past (i.e., leveraging the OMOP2OBO mappings to connect to a knowledge graph like PheKnowLator), but I was curious to hear what you think/feel about this given your experience with constructing this phenotype?

Do you still think that type of approach could potentially be impactful? I have ideas (and hopes) on how a knowledge-based approach might potentially help but wanted to check in with you first before I potentially bias the conversation :blush:.

1 Like

Totally agree with you @Patrick_Ryan. A lot of this comes from making the unspoken assumption on what each datasource have and what they don’t. For example, based on “our experience”, tumor diagnosis is accurately captured in a tumor registry but diagnostic procedures and detailed treatment information is not represented with high accuracy. On the other hand, “we know” claims should have procedure codes and but not accurate diagnosis information (you get a diagnosis for breast cancer and not TNBC). and we use all these subjective fact for creating and evaluating the performance of phenotype. I think the solution is an empirical approach in capturing these “known facts” and applying them to them definitions and phenotype assessment.

1 Like

Thanks for the discussion. We are working on running CohortDiagnostics for these and other cancer cohort definitions. In the absence of that data I’d like to consider a limit case thought experiment. Consider a hypothetical representative TNBC patient named Lauren. She observes a lump in her breast, gets a mammogram, an MRI, and a biopsy that confirms TNBC. She has a mastectomy and starts chemotherapy treatment. Someone in the hospital is doing a survey study on breast cancer patients and she fills out a survey describing her diagnosis, treatment, and other quality of life measures. There are multiple observational data capture systems tracking Lauren’s experience (insurance claims, hospital EHR, cancer registry, survey study). Suppose that these systems are siloed (not integrated) and are each mapped into the OMOP CDM format (4 separate CDMs). The EHR system includes her diagnosis, procedures and measurements. If she is at JHU then it might also include structured data extracted from image data. :slight_smile: The claims billing system includes diagnosis, procedure, and drug codes. The cancer registry includes a code for TNBC with a highly accurate date of diagnosis along with stage and grade information. The survey includes a custom coding system for TNBC and dates of diagnosis as well as symptom onset. Each system represents Lauren’s journey in a different way. Would we say that any of these CDMs is missing data on Lauren? In a sense they are all missing something. Missing data is a tricky idea in observational data though. Harlan Krumholz has an interesting perspective here: “There is no missing data.” Google probably knows more about me than I know about myself. When I first created an account were they missing data? Are they still missing data after tracking every search query and every website I’ve visited for the last several years? They start with a prior about who I am and then keep updating that prior with each new observation. This is one way of thinking about phenotyping that has been discussed in other threads. If my goal was to capture Lauren (a cohort of 1) in each CDM using a rule based approach I would write a different algorithm for each database based on the underlying process that generated the data in that CDM. If I had to use only one algorithm on all four CDMs then I would combine all that logic into a single more complicated algorithm.

An alternative approach would be to have a general algorithm that would look at all the available data for a person and assign them to a point in a 100 or 200 dimensional embedding space that would represent their phenotype from a clinical, biological, and/or molecular perspective. Then we could define phenotypes based on clusters in that space. Each new observation would lead to an update of the state vector. The state vectors might also be considered completely free of PHI which would be a plus.

@callahantiff I’m so glad you’re joining this discussion! Yes, I think your work is on the cutting edge of phenotyping. Please bias away! How could we use OMOP2OBO to capture TNBC patients in data without a clear TNBC diagnosis?

1 Like

The idea of phenotype clusters in a multidimensional space is fascinating. You mention clinical, biological and/or molecular perspectives, which I interpret as groups of dimensions or domains. This idea might build on the existing high-dimensional propensity scores.

1 Like

@Adam_Black , I’ll take the Lauren challenge: here’s my single cohort specification that would work on all 4 CDMs:

  • entry event:
    – earliest of:
    — diagnosis of breast cancer + descendants (which includes TNBC)
    — biopsy procedure

  • inclusion criteria:
    –has TNBC diagnosis OR (measurement of ER negative AND measurement of progesterone negative AND measurement of HER2 negative) within 90d of index

This definition will identify your TNBC cases in EHR if you have the measurements, it’ll show you that can’t find cases in claims unless you get the specific diagnosis code of TNBC because you don’t have measurements, it’ll find your TNBC case in both your registry and your survey based on the specific diagnosis coding (assuming proper mapping into SNOMED). CohortDiagnostics will tell you that claims isn’t appropriate source to find TNBC (which could very well be true, based on the available data, it’s not the algorithm’s problem that the data source doesn’t have sufficient granularity).

1 Like

YES! Really great ideas here, I can tell you are ready to come over to the translational dark side with me!

I have some experience using these approaches, including the use of one-direction hashing for PHI. Since you laid all of this out for me, I want to re-work my response to be tailored to your proposed plan. I will be back tomorrow to post my ideas and will try to demonstrate where I think I can address some of your specific concerns.

Sorry for the delay @Adam_Black!

I think there are a couple of ways that one could approach the challenge of identifying patients where a TBNC diagnosis is not clear. Some of these come from the incorporation of external knowledge and some come from the incorporation of additional phenotyping algorithms.

Incorporation of Additional Phenotyping Algorithms
This is absolutely not me trying to promote today’s Phenotype Phebruary post, but I think what we demonstrated in it could possibly be of some value to you here. We implemented an Atlas definition and used APHRODITE. Doing this allowed us to create a Venn Diagram that could then be exploited to highlight patients that overlapped both and those that were only found for each method. This could potentially offer some clues into how you could expand your current Atlas-based definition. It does not guarantee you will find the “true” cases, but I think it could potentially yield some interesting insight. @Juan_Banda – what do you think about this idea?

Leveraging External Knowledge
I think it could be useful to start from the biology in terms of what we know about triple negative breast cancer. If you like this idea we can dive deeper, but here is my first pass at this. Consider this paper: Triple negative breast cancer: Deciphering the biology and heterogeneity - ScienceDirect, which discusses the heterogeneity in patients with breast cancer, specifically TNBC. They do an excellent job of highlighting these differences and propose subtypes with distinct molecular signatures. From my perspective, these distinctions offer opportunities that you can leverage to find your patients. Let’s look at some of their figures.

In the figure below, they are highlighting how different molecular techniques can be leveraged to identify different important biomarkers. Why you should care – each of these biomarkers are represented in the PheKnowLator KG and can connected to features which we have mapped to OMOP standard terminologies.

Screen Shot 2022-02-25 at 12.42.11

A deeper dive was taken in the second figure shown below, where they applied the PAM50 algorithm to classify different types of breast cancer. From the paper directly:

A third test, consisting of an algorithm for the intrinsic molecular classification of BC, has been nominated PAM50 (Fig. 2). This was designed to improve IHC and microarray classification aggreance.24 This 50-gene signature can classify BCs as luminal A, luminal B, HER2 and basal-like. The PAM50 score was designed with the purpose of translating the different intrinsic subtypes into an associated prognostic value.24 An application of this score is the identification of patients who may benefit from the weekly addition of paclitaxel to conventional chemotherapy with anthracycline as an adjuvant treatment of operable BC with positive lymph nodes.25 In 2013, Prosigna (Seattle, Washington, U.S.) started commercializing a diagnostic kit which qualifies mRNA expression of the 50 genes used by the algorithm in order to calculate the risk of recurrence.26

The key takeaway here is translating “intrinsic subtypes into associated prognostic value”. Being able to leverage differences in the underlying biology to identify features that are distinct, but which also have clinical utility when applied to associated patient populations. Maybe you are missing important patients and including ones that you shouldn’t in your current definition. This leads to an interesting question – does the EHR reflect our current biological understanding of these patients? I bet not entirely…

Screen Shot 2022-02-25 at 12.42.18

Final picture, I promise. This one is really nice and where I think you can directly benefit (I hope). It shows that there are likely many different subtypes of TNBC and that each of these subtypes has known associated biological processes/molecular functions (that differ between subtypes) – in this analysis that were derived from the Gene Ontology.

Screen Shot 2022-02-25 at 12.42.31

OK – so what does one do with all of this? Here’s one relatively straightforward idea. Given that the molecular signatures for the different subtypes of TNBC have already been defined (think sets of genes identified as experimentally relevant to each subtype), you can use these as a starting point to enter the PheKnowLator KG. For each gene, you could then follow different paths to relevant known symptoms/diagnoses and drugs (and other things) as well as obtain paths that are similar molecularly but might also point to different symptoms/diagnoses and drugs. Then, you can use our mappings from these KG entities to obtain the associated OMOP concepts. With these new concepts you could:

  • Consider updating your current definition and spend time examining how the results cohorts differ and where they are the same
  • Define new concept sets that capture the concepts unique to each TNBC subtype, build those cohorts and then explore how they differ at the concept level, but more importantly at the patient-level. you could then examine the other concepts (not in your definition) that are important, including those that are unique to each subtype. Doing this would allow you to identify different intersections between the patients (think Venn Diagram) and their differences. Helping you capture new types of patients and/or concepts that you might have excluded before based on the rule-based approach.
  • You could use these concepts as seeds for APHRODITE and obtain some probabilistic cohorts and explore them, with respect to your current cohort in a similar way as I mentioned above (or how we did in our depression cohort).

But there are MANY other ways you could explore this. One intuitive approach (I am currently exploring in my work at Columbia) is to work from the concepts in your current definition, connecting them through the mappings to the KG, and then look for similar and relevant “pathways” through the KG which could be used to expand or edit your current cohort definition.

Sorry for the long response. This is an area I am very excited about! :star_struck: . Looking forward to seeing what you think about all of this @Adam_Black and @Patrick_Ryan (given our recent discussions).

1 Like

Hi @callahantiff, I’m sorry for the very delayed response. I love your ideas would like to work on them with you. First I need to find a data partner though. @PriyaDesai - Would you be interested in partnering with us on developing this and other cancer phenotypes using STARR-OMOP?

Sounds great @Adam_Black! – I will need a few weeks before I will be able to get involved. Hope that’s OK!

When looking for proper mapping of ER I was lucky to find this thread, but
@Christian_Reich @mik it would be really good to have a table with at least most common biomarkers mapped to OMOP vocabulary.
Also I didn’t find a documentation on OMOP genomic in vocab github wiki