My takeaway from this NSCLC phenotype and the Triple negative breast cancer phenotype is a call to arms to obtain the missing data. And try to refrain from treatment-based surrogates or proxy cohorts.
With some data assets, obtaining the missing data may not be possible (likely claims data), but with EHR data sets, it is possible to enrich tumor site, histology, ER status, PR status, and HER2 status (along with other pathology findings). Either by using:
- NLP abstraction/curation pipelines applied to pathology reports in clinical notes.
- Tumor registry data that has these data points pre-abstracted.
Regarding NLP, a proposal is brewing in the NLP working group that advocates rescuing NLP outputs from being stranded in the NOTE_NLP table and promoting them to the standardized clinical tables (when a threshold of confidence is met).