@jmethot , thank you for clearly outlining the need for episodes in oncology analytics. They are key for answering most common cancer research questions. This has been established.
As for episode derivation, it is a rapidly evolving area. On one hand, there are many research registries in oncology that manually abstract this information. One of the most prominent and reliable is US Tumor Registry. We will be definitely leveraging these sources. On the other hand, there are evolving deterministic and probabilistic algorithms that derive episodes from low-level events. None of these methods are perfect or conform to the same rules and conventions. However, they have been used outside of OHDSI and will continue to develop.
Therefore, we introduced the foundational structure for persisting episodes and initial set of conventions for episode population/derivation. This platform enables testing different methods of episode derivation and using them in analysis.
@MPhilofsky , when you say ‘pregnancy is an episode’, can you help me understand how you would differentiate a ‘pregnancy episode’ from a ‘pregnancy cohort’ entry? I ask because we are actively working on phenotypes and cohort definitions to represent the span of time that a person belongs to a health state, and pregnancy is one of the specific use cases we have been using for cohorts, where we’d like for a cohort start date to be at conception and a cohort end to be when the pregnancy outcome is observed. There can be cohorts for specific pregnancy outcomes (e.g. livebirth, stillbirth, abortion) and also a composite cohort that combines all pregnancy outcomes to represent the collective spans of time that a women is in the ‘pregnant state’.
@Christian_Reich - lets say we have a standard repository of peer reviewed algorithms (version controlled and enumerated) that can run on the core CDM tables (visit_occurrence, condition_occurrence, drug_exposure etc) and create an output that is an episode as defined as continuous span of time the person had the episode
and we export these algorithms with every installation of OHDSI software (i.e. for example, atlas would ship with it built in)
would that address these concerns
Not an ETL job since highly variable and dependent on the study
now, it is not study specific. it is not highly variable infact it is standardized.
Finite predefined list
This repository of algorithms (lets call it OHDSI Phenotype library) would be a finite list. It would be credible and trusted.
Would be consistent across OHDSI network
Would be built from OMOP converted data
Looks like you have a stake in this, don’t you? Well, let’s dissect.
Stop right here. Just like ETL, I claim you cannot build Episodes on the OMOP CDM standard tables alone. You need the source. Reason: Depending on what the source captured, the definitions would differ. For example, take the episode “Progression”. You can get that from an abstracted record in a tumor registry, or from the path lab, imaging report or clinical record in an EHR (which may have to be NLPed out or is in some kind of structured place).
Also, it cannot be peer reviewed. Peers cannot see the source. Episodes are built against a set of requirements.
Very nice idea, but that doesn’t make it part of the CDM. That’s a convenience thing.
Well, come on. The cohorts are made for the studies. In your head, as you come up with the content of the library, you abstract from a multitude of typical studies and standardize to that. But there are potentially an infinite number of cohorts you create and standardize. Or is the current list everything you could ever need?
Also, your standard cohorts are always Conditions. You don’t make cohorts for other domains. Why? Because we cannot rely on the diagnostic codes for reliable condition cohorts. They are overreported and underreported, their timing stinks, and their definition may not match what you need. So, you do all the gymnastics to get around those shortcomings (without making transparent what gymnastic move is addressing what issue, as I have previously complained).
Episodes have a different purpose: They are abstracted conditions as well, but they also describe the dynamic nature of the disease, and they organize complex treatments.
Finally, the episodes keep the connection to the events they are built from, or they are related to (EPISODE_EVENT table). Cohorts do not.
@Christian_Reich and @Gowtham_Rao ,
As I described in this thread, we introduced a list of condition modifiers (e.g. 734306 = ‘Initial diagnosis’) that supports preserving available source information related to episode definition. We have already added these modifiers to the ETL from Tumor Registry. We are also recommending this for ETL of any data that has insights about episodes. This data along with concept_type_id will be used for post-ETL derivation of episodes. Therefore, you can and should build Episodes on the OMOP CDM standard tables. Moreover, before we develop any tools that use Episodes, these modifiers can be also surfaced by available tools.
@Christian_Reich and @Gowtham_Rao, we are developing algorithms that are based solely on regular OMOP tables (e.g. Regimen Finder is based on DRUG_EXPOSURE). Therefore, they can be peer-reviewed. I have proposed an idea very similar to @Gowtham_Rao for a repository of algorithms where each algorithm will have a concept assigned. This will allow for preserving provenance of episodes in Episode.Episode_Type_Concept_ID.
Looks like @rimma disagreed with you. Also sounds like a claim that is not substantiated. So lets strike this off.
What are we talking about? changing data representation (i.e. converting to OMOP form) or deriving new events (i.e. using some intelligence to generate new data from other data using an algorithm) I am only interested in the later i.e. algorithm. I do not want to empower the ETL’r to do this - as then you are giving ETL’er too much power to interpret source data and derive new data and this is NOT good for reproducible research (e.g. ETL rules may be unknown, have errors). By using cohort approach - we are making them all available.
Lets strike this off - same argument as above i.e. derivation is not from source data but from core OMOP CDM tables.
Incorrect @Christian_Reich - i think few years ago that was the correct, but right now - we are thinking (we = OHDSI Phenotype Development and Evaluation workgroup) are thinking of a ‘target’ and cohort definitions are algorithms trying to identify cohort that match that target. Any deviation from target is the error. An individual study is not a component of Phenotype Development and Evaluation. Happy to discuss that - come join the workgroup
Finally, the episodes keep the connection to the events they are built from, or they are related to (EPISODE_EVENT table). Cohorts do not.
All valid points, and use of Episode tables is valid. My position is not whether we need to populate/use Episode table. I am arguing against the ETL’r making undocumented/unreproducibile algorithmic choices to populate a table with derived content. If you want to just do a ETL of source to target - sure go for it, but the T should be minimal and have record level referential integrity to the source where possible.
If instead you have an algorithmically derived summary of multiple records in source, especially if it is running on source tables - then I think thats not good!
These tables are to be calculated from omop core tables/clinical data tables and NOT source tables.
So - now some of the arguments makes sense…
but we will have the age old problem of not being able to fully trust these derived tables. e.g. in OHDSI network studies, we rarely seem to use condition_era, drug_era – but use the condition_occurrence and drug_exposure. this is the reason for THEMIS
Nice debate here. But it is getting long. Let me see:
Correct. If the source tells us what the episodes are we are all set. But the debate is about derivation when you don’t have that, à la @Gowtham_Rao’s phenotypes. He claims all you need is regular OMOP tables and you can do it, reliably, from the Conditions and Modifiers.
That would be wonderful. But apart from the fact that we are far away from having the logic for such algorithms for disease episodes, the question remains: Could they work without the context of the source data? Is this an ETL job, or an universal phenotype-like job?
You seem to be claiming that the Type concept will provide sufficient context for each Modifier (stage, grade, mets, nodes) telling us how much the algorithm should believe it. But there is more trouble lurking:
What about unstructured imaging and path lab reports?
What about contradictions between EHR and registries, or contradictions between different source information?
What about incomplete information? For example, some ambulatory clinic will record chemotherapy, but it will not record surgery in a adjuvant or neoadjuvant setting, or autologous stem cell transplantation. Similarly, administration of oral chemotherapy is often organized differently than parenteral.
In other words, all this is so messy that we need to give the ETLer some serious power to make the right choices. That is the point of the Episodes. The analyst using OMOP tables alone would be lost.
That’s a good thing. In our workgroup, we are actually debating things.
You realize that all OMOP databases are ETLed, do you? The ETLer has to make a ton of decisions of how to interpret the source data in such a way that it fits the intended representation of the CDM and vocabulary. And no, those decisions are not peer-reviewable. Plus: If anything I am not mistrusting the ETLer like you do. But if algorithms can make her life easier I am all for it.
What is that? And how is that not used for studies? Let me quote @Patrick_Ryan:
True. @Patrick_Ryan made a list and he community voted on them. Where did it get its wisdom from? They are needed for disease setting and outcomes in the studies folks are running all the time.
But you are right, I shouldn’t debate the phenotypes here, except whether or not they are the same thing as the episodes.
Back in Phebruary they certainly were. Do you have outcomes that are not conditions now?
You’d be surprised to hear that from me, but actually if we could create standardized episodes purely from structured data in the OMOP tables I don’t think we’d need them. We’d just use your phenotypes. Episodes only have a life if they need to be populated pre or peri-OMOP.
That’s what it boils down to. I hope you and @rimma will be right. Till then, my hope is we can arrive at some mixture: standard algorithms, that are using OMOP tables, but are configured with information the ETLer has obtained from the source data or by asking folks in the institution.
Great end to this focused discussion @Christian_Reich . The key insights i have learnt that reinforced some of positions.
Episode table is a derived table. It is derived, like condition_era, drug_era, from the core clinical CDM tables.
Although built during pre-processing/set-up of the CDM, episode table do not interact with source data in any form. i.e. it is a phenotype like algorithm.
The output of the phenotype like algorithm is different from cohort i.e. its more than subject_id, cohort_start_date and cohort_end_date and includes elements that a cohort algorithm would not support.
Truly, the use case i think it supports is to algorithmically separate care events that may be unrelated. e.g. if i am getting knee surgery, but during the same days also get a dental workup - the episode table relates the events of knee surgery together i.e. bundle, but does not link to the unrelated events i.e. dental.
It would be nice to derive it from core tables, but I don’t see it. It is an ETL job. Could be an algorithmic job that happens by the ETLer, while the context is still fresh, but not something that can happen completely independently like phenotypes.
We are not 100% that about that either. There is still the possibility we have to attach Cancer Modifiers to Episodes, rather than Conditions
Correct.
Yes, your use case is correct. Even though only somebody as industrious as @Gowtham_Rao can have a dental workup the same day that he is under the knife for knee surgery.
Is @Christian_Reich saying there is information in the source data that is not preserved in the CDM that could have been used to create episodes if we just created them early enough? Including say a source registry telling us that this group of events is an episode.
If there is information about episodes, where would we put it if not the episode table (because that is derived)? In the type field, perhaps fact_relationship? I worry that those are not used consistently.
Speaking as an ETLer, I’ve been following this thread because with version 5.4, I either have to fill the Episode and Episode_Event tables or I do not. If I don’t, I’m happy as a clam (assuming clams are happy). But if I do, I have to figure out where this information is stored in my source. In my ETL persona, if it’s not in the source, it doesn’t exist. However, if I DO have it in my source, am I supposed to pretend it doesn’t exist?
My source system is a little EHR called Epic which does have Episode information, and I’m trying to figure out if their definition of episode is the same as OMOP’s.
“This table contains high-level information on the episodes recorded in the clinical system for your patients. When a provider sees a patient several times for an ongoing condition, such as prenatal care, these encounters can be linked to a single Episode of Care.”
It kind of sounds the same to me.
It appears that, in Epic, in started as an episode specifically for pregnancy, but it has been expanded to include others like transplants, radiation therapy, nephrology, home infusion, anti-coag, social care, and others. How well these and others have been implemented in my Epic system, I don’t know yet. Hence my interest in knowing whether or not I should bother.
So, if a physician determines that a set of conditions and treatments comprise an episode and defines it as such in the EHR, why is that not trust-worthy?
I agree it’s a nuisance that EHRs have differing degrees of reliability in their implementation of clinical data, but if we can’t trust the source, what are we doing here? Ultimately, it ALL comes down to the decisions that we ETLers have to make with regards to what and where the data goes into OMOP. Believe me, we don’t want the power to derive information, but we do it every day anyway.
Christian is always going on about use-cases, and it’s obvious there’s a pretty large use-case for oncology. But is that the only one? Can all use-cases use algorithms to derive episode data? Is it truly an either/or situation between source-derived or algorithm-derived? I do note that the Episode table includes the Episode_type_concept_id, which allows for differentiation between source and algorithm derived episodes.
I’m just concerned that the focus on oncology is narrowing the use of episodes too much.
This debate popping back up is intriguing. It is the subject of which we discussed at length in the Oncology working group, over a year ago, and came to consensus to push forward with the assumption that episodes would be derived ‘post ETL’, or rather, defined by leveraging the ontologies and referencing concepts and records already in standard OMOP tables. @Christian_Reich It would seem that you now feel otherwise and I’d be interested in what changed, but it is clear you are not alone in this stance.
There are several reasons why this approach was preferred but what hasn’t been mentioned here yet is that the idea of creating episodes during the ETL is fundamentally impractical if you are merging data from more than one source with overlapping patient populations. The most common example we’ve encountered is where sites have both EHR and tumor registry data - the same patient population, the same disease occurrence, but different and complementary data. The more comprehensive your representation of the patient journey is, the more accurate your episode derivation can be. If you are deriving episodes during the ETL, without the full context of information available, and then try to merge that data together, you have conflicting and overlapping episodes that would need to be rebuilt with each additional data source.
Here is an image I created to illustrate the point (from an email in July of last year)
For context: clinical events are what bookend episodes
Outside of that, it boils down to a choice of the general approach of limiting bias and increasing interoperability by not requiring that the individual ETL developer make all of the judgement calls that would be required to infer these episodes. In other words, asking ETL development to only relay the data that exists in the source. As mentioned above, it is unclear how we can have any faith in episode analyses between multiple sites if there isn’t consistency into the manner in which they were created.
Not to say that this will be easy, in either approach, or even that the definition of these episodes will be consistent, say if one site defines progression in a different manner than another, but if we are able to keep track and codify these derivation methods, we enable the ability to run studies with confidence that the episodes were derived using the same criteria.
With the stated hurdle of the ‘during ETL’ approach illustrated above, unless there is some approach to work around that, what makes the ‘post ETL’ derivation process impossible?
For the ‘post ETL’ approach I see the requirements as:
Persist all relevant data in standard OMOP tables. If there isn’t already a concept or home for a piece of information, we make one. This clearly has the extra benefit of providing more evidence outside of the derivation process. If there is an argument that “this source data can’t fit in the standard OMOP tables”, I would ask why not enable it to?
Develop sharable definitions of episode derivations. Again, this in no stretch of the imagination will be trivial, involving both concept relationships and likely complicated logic, but wouldn’t leveraging the community to come up with these definitions be more feasible than asking each individual ETL developer to do so?
This portion of the book of OHDSI seems relevant here:
Don’t pretend it doesn’t exist, write in your OMOP specification “not populated at this time”. Almost all CDM tables are optional. Let the use case drive the implementation. If you don’t have a use case, don’t populate the table.
And if/when you analyze your source episode data, I, and probably other Epic users, would be interested to know your findings. I was told our pregnancy episode data wasn’t everything our researchers desired. The hardest observation/condition to accurately identify is when a pregnancy ends with a miscarriage. But this is the nature of the data. Certain things (OTC drugs, miscarriage, minor injuries/illness) aren’t actively or accurately recorded in the data at time of event.
This is a great debate. A continuation of an issue debated during ‘Pehnotype Phebruary’:
I would articulate the issue as:
Are there limits to what kind of clinical facts can be represented by standard OMOP, phenotypes and standard analytics?
It is no coincidence that in Phebruary this issue was most hotly debated in phenotypes involving cancer. Cancer research demands the representation of clinical facts like ‘histology’, ‘staging’, ‘disease progression’, and ‘treatment lines’. The OHDSI Oncology Working Group was formed out of difficulties representing these important cancer clinical facts in OMOP.
Thus was introduced: Structures: EPISODE, EPISODE_EVENT, MEASUREMENT.measurement_event_id, MEASUREMENT.measurement_event_id, MEASUREMENT.meas_event_field_concept_id. Vocabularies: ICDO3, NAACCR, Cancer Modifier, Hemonc.org, CAP electronic Cancer Checklists (College of American Pathologists) Standardized ETLs: NAACCR ETL Treatment Regimen Detection Algorithms: OncoRegimenFinder, Tracer (AJOU University)
It looks like the OHDSI community has formed various ‘factions’ around how to answer the above-articulated issue.
Faction 1 (the ‘Methodists’):
All clinical facts can be represented by current standard OMOP, phenotypes, and standard analytics.
We don’t need the EPISODE table.
ETL’ers should not interpret source data and derive new data.
Rote ETLs and statistics are enough.
Members: Patrick Ryan, Gowtham Rao
Faction 2 (the ‘Derivers’):
Not all clinical facts can be represented by standard OMOP, phenotypes, and standard analytics.
We need the EPISODE table to represent cancer disease phases and treatment lines.
Episodes belong in the standardized derived elements.
We need to come up with modifiers (e.g. 734306 = ‘Initial diagnosis’) and conventions on how to populate cancer events in the traditional OMOP standardized clinical event tables.
These modifiers and conventions will enable the development of a post-ETL algorithm to derive cancer disease and treatment episodes from the standardized clinical event tables.
Episode population should not be opaque and depend on data source context.
ETL’ers using modifiers, adhering to conventions and a promise to develop a post-ETL algorithm are enough.
Members: Rimma Belenkaya, Robert Miller.
Faction 3 (the ‘Contextualists’):
Not all clinical facts can be represented by standard OMOP, phenotypes, and standard analytics.
We need the EPISODE table to represent cancer disease phases and treatment lines.
Episodes belong in the standardized clinical data tables.
We need to come up with simple semantic targets based on oncology research standards to support the population of disease phases and treatment lines in the EPISODE table.
Episode population can only be done in the context of source data by an advanced informatics infrastructure that supports an ETL’er.
Such advanced informatics infrastructure will include generating abstractions from unstructured imaging and pathology lab reports and reconciling EHR/claims data with multiple sources: e.g., tumor registry, oncology analytic platforms, and oncology EMRs.
Clear semantic targets and institutions with advanced informatics infrastructure are enough.
Members: Christain Reich, Asieh Golozar, Michael Gurley
I know these ‘factions’ are overly simplistic. Please take my personal assignments to the factions as an attempt to sharpen the contours of the debate. Nothing more.
One last thing that I will say is that the oncology world is furiously already engaged in the process of interpreting source data and deriving new data. There is a cottage industry of commercial and open-source solutions helping institutions to generate cancer disease phases and treatment lines. Every academic medical cancer center in the country is engaged in such efforts. So fighting against that tide is really a decision for the OHDSI community to be open or closed to incorporating these emerging oncology data assets. One thing the cottage industry has not provided is an open-source clear structure and semantics to represent cancer disease phases and treatment lines. mCode has made significant contributions in this area but is more focused on data collection and transport. Not open data analysis. I think OHDSI has a great opportunity to be that open-source structure and semantics for oncology. But we need to make the target simple enough that folks can achieve populating disease and treatment episodes within our lifetimes.
I encourage everyone who participated in this discussion to wait for the Episode convention documentation from the Oncology Workgroup as it resulted from multiple comprehensive discussions like this one with weighing pros, cons, and input from @mgurley, @Christian_Reich, @rtmill , @agolozar , and many others. I have to admit, we have never reached 100% consensus. However, to move forward, we used the majority rule which mostly agreed with @Gowtham_Rao’s positions stated above.