OHDSI Home | Forums | Wiki | Github

[2022 US Symposium] #66 - Episode and Episode Event Tables Documentation

Looks like my response was too long. :slight_smile: I am saying:

  1. It would be nice to derive it from core tables, but I don’t see it. It is an ETL job. Could be an algorithmic job that happens by the ETLer, while the context is still fresh, but not something that can happen completely independently like phenotypes.
  2. We are not 100% that about that either. There is still the possibility we have to attach Cancer Modifiers to Episodes, rather than Conditions
  3. Correct.

Yes, your use case is correct. Even though only somebody as industrious as @Gowtham_Rao can have a dental workup the same day that he is under the knife for knee surgery.

Thank you for the summary and discussion, @Gowtham_Rao @Christian_Reich etc.

Is @Christian_Reich saying there is information in the source data that is not preserved in the CDM that could have been used to create episodes if we just created them early enough? Including say a source registry telling us that this group of events is an episode.

If there is information about episodes, where would we put it if not the episode table (because that is derived)? In the type field, perhaps fact_relationship? I worry that those are not used consistently.

Speaking as an ETLer, I’ve been following this thread because with version 5.4, I either have to fill the Episode and Episode_Event tables or I do not. If I don’t, I’m happy as a clam (assuming clams are happy). But if I do, I have to figure out where this information is stored in my source. In my ETL persona, if it’s not in the source, it doesn’t exist. However, if I DO have it in my source, am I supposed to pretend it doesn’t exist?

My source system is a little EHR called Epic which does have Episode information, and I’m trying to figure out if their definition of episode is the same as OMOP’s.

“This table contains high-level information on the episodes recorded in the clinical system for your patients. When a provider sees a patient several times for an ongoing condition, such as prenatal care, these encounters can be linked to a single Episode of Care.”

It kind of sounds the same to me.

It appears that, in Epic, in started as an episode specifically for pregnancy, but it has been expanded to include others like transplants, radiation therapy, nephrology, home infusion, anti-coag, social care, and others. How well these and others have been implemented in my Epic system, I don’t know yet. Hence my interest in knowing whether or not I should bother.

So, if a physician determines that a set of conditions and treatments comprise an episode and defines it as such in the EHR, why is that not trust-worthy?

I agree it’s a nuisance that EHRs have differing degrees of reliability in their implementation of clinical data, but if we can’t trust the source, what are we doing here? Ultimately, it ALL comes down to the decisions that we ETLers have to make with regards to what and where the data goes into OMOP. Believe me, we don’t want the power to derive information, but we do it every day anyway.

Christian is always going on about use-cases, and it’s obvious there’s a pretty large use-case for oncology. But is that the only one? Can all use-cases use algorithms to derive episode data? Is it truly an either/or situation between source-derived or algorithm-derived? I do note that the Episode table includes the Episode_type_concept_id, which allows for differentiation between source and algorithm derived episodes.

I’m just concerned that the focus on oncology is narrowing the use of episodes too much.

1 Like

This debate popping back up is intriguing. It is the subject of which we discussed at length in the Oncology working group, over a year ago, and came to consensus to push forward with the assumption that episodes would be derived ‘post ETL’, or rather, defined by leveraging the ontologies and referencing concepts and records already in standard OMOP tables. @Christian_Reich It would seem that you now feel otherwise and I’d be interested in what changed, but it is clear you are not alone in this stance.

There are several reasons why this approach was preferred but what hasn’t been mentioned here yet is that the idea of creating episodes during the ETL is fundamentally impractical if you are merging data from more than one source with overlapping patient populations. The most common example we’ve encountered is where sites have both EHR and tumor registry data - the same patient population, the same disease occurrence, but different and complementary data. The more comprehensive your representation of the patient journey is, the more accurate your episode derivation can be. If you are deriving episodes during the ETL, without the full context of information available, and then try to merge that data together, you have conflicting and overlapping episodes that would need to be rebuilt with each additional data source.

Here is an image I created to illustrate the point (from an email in July of last year)


For context: clinical events are what bookend episodes

Outside of that, it boils down to a choice of the general approach of limiting bias and increasing interoperability by not requiring that the individual ETL developer make all of the judgement calls that would be required to infer these episodes. In other words, asking ETL development to only relay the data that exists in the source. As mentioned above, it is unclear how we can have any faith in episode analyses between multiple sites if there isn’t consistency into the manner in which they were created.

Not to say that this will be easy, in either approach, or even that the definition of these episodes will be consistent, say if one site defines progression in a different manner than another, but if we are able to keep track and codify these derivation methods, we enable the ability to run studies with confidence that the episodes were derived using the same criteria.

With the stated hurdle of the ‘during ETL’ approach illustrated above, unless there is some approach to work around that, what makes the ‘post ETL’ derivation process impossible?

For the ‘post ETL’ approach I see the requirements as:

  • Persist all relevant data in standard OMOP tables. If there isn’t already a concept or home for a piece of information, we make one. This clearly has the extra benefit of providing more evidence outside of the derivation process. If there is an argument that “this source data can’t fit in the standard OMOP tables”, I would ask why not enable it to?
  • Develop sharable definitions of episode derivations. Again, this in no stretch of the imagination will be trivial, involving both concept relationships and likely complicated logic, but wouldn’t leveraging the community to come up with these definitions be more feasible than asking each individual ETL developer to do so?

This portion of the book of OHDSI seems relevant here:

1 Like

Don’t pretend it doesn’t exist, write in your OMOP specification “not populated at this time”. Almost all CDM tables are optional. Let the use case drive the implementation. If you don’t have a use case, don’t populate the table.

And if/when you analyze your source episode data, I, and probably other Epic users, would be interested to know your findings. I was told our pregnancy episode data wasn’t everything our researchers desired. The hardest observation/condition to accurately identify is when a pregnancy ends with a miscarriage. But this is the nature of the data. Certain things (OTC drugs, miscarriage, minor injuries/illness) aren’t actively or accurately recorded in the data at time of event.

This is a great debate. A continuation of an issue debated during ‘Pehnotype Phebruary’:

I would articulate the issue as:

Are there limits to what kind of clinical facts can be represented by standard OMOP, phenotypes and standard analytics?

It is no coincidence that in Phebruary this issue was most hotly debated in phenotypes involving cancer. Cancer research demands the representation of clinical facts like ‘histology’, ‘staging’, ‘disease progression’, and ‘treatment lines’. The OHDSI Oncology Working Group was formed out of difficulties representing these important cancer clinical facts in OMOP.

Thus was introduced:
Structures: EPISODE, EPISODE_EVENT, MEASUREMENT.measurement_event_id, MEASUREMENT.measurement_event_id, MEASUREMENT.meas_event_field_concept_id.
Vocabularies: ICDO3, NAACCR, Cancer Modifier, Hemonc.org, CAP electronic Cancer Checklists (College of American Pathologists)
Standardized ETLs: NAACCR ETL
Treatment Regimen Detection Algorithms: OncoRegimenFinder, Tracer (AJOU University)

It looks like the OHDSI community has formed various ‘factions’ around how to answer the above-articulated issue.

  • Faction 1 (the ‘Methodists’):

    • All clinical facts can be represented by current standard OMOP, phenotypes, and standard analytics.
    • We don’t need the EPISODE table.
    • ETL’ers should not interpret source data and derive new data.
    • Rote ETLs and statistics are enough.
    • Members: Patrick Ryan, Gowtham Rao
  • Faction 2 (the ‘Derivers’):

    • Not all clinical facts can be represented by standard OMOP, phenotypes, and standard analytics.
    • We need the EPISODE table to represent cancer disease phases and treatment lines.
    • Episodes belong in the standardized derived elements.
    • We need to come up with modifiers (e.g. 734306 = ‘Initial diagnosis’) and conventions on how to populate cancer events in the traditional OMOP standardized clinical event tables.
    • These modifiers and conventions will enable the development of a post-ETL algorithm to derive cancer disease and treatment episodes from the standardized clinical event tables.
    • Episode population should not be opaque and depend on data source context.
    • ETL’ers using modifiers, adhering to conventions and a promise to develop a post-ETL algorithm are enough.
    • Members: Rimma Belenkaya, Robert Miller.
  • Faction 3 (the ‘Contextualists’):

    • Not all clinical facts can be represented by standard OMOP, phenotypes, and standard analytics.
    • We need the EPISODE table to represent cancer disease phases and treatment lines.
    • Episodes belong in the standardized clinical data tables.
    • We need to come up with simple semantic targets based on oncology research standards to support the population of disease phases and treatment lines in the EPISODE table.
    • Episode population can only be done in the context of source data by an advanced informatics infrastructure that supports an ETL’er.
    • Such advanced informatics infrastructure will include generating abstractions from unstructured imaging and pathology lab reports and reconciling EHR/claims data with multiple sources: e.g., tumor registry, oncology analytic platforms, and oncology EMRs.
    • Clear semantic targets and institutions with advanced informatics infrastructure are enough.
    • Members: Christain Reich, Asieh Golozar, Michael Gurley

I know these ‘factions’ are overly simplistic. Please take my personal assignments to the factions as an attempt to sharpen the contours of the debate. Nothing more.

One last thing that I will say is that the oncology world is furiously already engaged in the process of interpreting source data and deriving new data. There is a cottage industry of commercial and open-source solutions helping institutions to generate cancer disease phases and treatment lines. Every academic medical cancer center in the country is engaged in such efforts. So fighting against that tide is really a decision for the OHDSI community to be open or closed to incorporating these emerging oncology data assets. One thing the cottage industry has not provided is an open-source clear structure and semantics to represent cancer disease phases and treatment lines. mCode has made significant contributions in this area but is more focused on data collection and transport. Not open data analysis. I think OHDSI has a great opportunity to be that open-source structure and semantics for oncology. But we need to make the target simple enough that folks can achieve populating disease and treatment episodes within our lifetimes.

I hope everybody is enjoying themselves. I certainly am.

2 Likes

I encourage everyone who participated in this discussion to wait for the Episode convention documentation from the Oncology Workgroup as it resulted from multiple comprehensive discussions like this one with weighing pros, cons, and input from @mgurley, @Christian_Reich, @rtmill , @agolozar , and many others. I have to admit, we have never reached 100% consensus. However, to move forward, we used the majority rule which mostly agreed with @Gowtham_Rao’s positions stated above.

Ha, never knew I was in a faction:) Or that I’d be stuck in a faction with @Gowtham_Rao :slight_smile:

But just to be clear about my position, I am NOT opposed to the notion of a EPISODE and EPISODE_EVENT table, I just think that we need to clearly differentiate what we expect to go into these tables vs. other available derived tables, such as DRUG_ERA, CONDITION_ERA, and COHORT.

I remember an original (narrower) motivation for this was to capture oncology treatment regimens, where there was an expressed desired to capture the duration of time that a person was on some combination or sequence of different drugs (that duration would be in the EPISODE table), and there was also interest in maintaining provenance of the verbatim elements that comprised that regimen (those relationships were to be maintained in the EPISODE_EVENT table). I thought that these regimens may actually be coming from the verbatim data in some source systems, but whether it was sourced or derived doesn’t seem too material to me. In either case, its clear that DRUG_ERA isn’t the place for an oncology regimen, because currently it only represents duration of drugs at the single ingredient level. The EPISODE component of the regimen could be stored in the COHORT table, but the EPISODE_EVENT component would then have to be stored in the FACT_RELATIONSHIP table, which it seems most people aren’t actively of consistently using. Its likely that ATLAS couldn’t readily model all complex treatment regimens without additional features (like having exit criteria based on stop dates of multiple different drugs), but I don’t think the capabilities of our standardized tools should influence data modeling decisions.

When we start talking about more general use cases, such as any chronic disease states, then it becomes trickier for me to understand why we would use EPISODE and not COHORT, or vice versa, and from a data modeling perspective, if we end up with two viable options, then we have no good solution, because we’ll end up with different people using different conventions. COHORT is currently the table that we want to store the results of cohort definitions, which is our implementation of phenotype algorithms, where a phenotype is defined as: “a specification of an observable, potentially changing state of an organism”, and a phenotype algorithm is “algorithms that identify or characterize phenotypes, which may be generated by domain experts and knowledge engineers, or through diverse forms of machine learning to generate novel representations of the data” (per Hripcsak and Albers JAMIA 2017). Our OHDSI definition of cohort = a set of persons who satisfy one of more inclusion criteria for a duration of time. A cohort era (which is one record in the COHORT table) = one continuous period when a person satisfied the inclusion criteria. This cohort construct has formed the basis of how we develop our phenotypes, and also serve as the inputs to our various analysis packages.

So, the big open question is my mind is how do we differentiate which content do we expect to be in EPISODE and not in the COHORT table or elsewhere?

2 Likes

@Patrick_Ryan

Conceptually, Episodes are much closer to Eras then to Phenotypes/Cohorts. They are derived continuous periods of disease or treatments defined by formalisms, like treatment regimen, disease progression, metastatic disease, etc. Those formalisms are foundational for building cancer Phenotypes/Cohorts.

Having said this, it is a question of chicken and egg. To derive metastatic disease episodes, one will need a Phenotype definition of a metastatic disease that will adhere to all the rules of OMOP Cohort definition.

I believe the primary differences between Episode and Cohort are:

  1. Episodes are designed to formalize through vocabulary and represent temporally widely accepted cancer phenotypes while Cohorts will have multiple non-formalized and not organized temporally fragmented phenotypes. For example, first line of treatment, second line of treatment, first episode of progression, second episode of progression.

  2. Episodes are designed to streamline further phenotype building and analytics based on those formalisms.

(The reason why we introduced Episodes and not used Eras was the limitations of Eras including: inability to annotate Eras with those formalisms, inability to group multiple ingredients as you stated, inability to nest and chain Eras, limited algorithm for the derivation of Eras, and more.)

1 Like

I think of an era as something that is defined by a standard code, like an ingredient.

And a cohort has no formal name or code. Just an identifier and an implicit definition.

Will an episode have some standard code to identify its type? Or an identifier and you figure out what kind of episode it is indirectly from the definition that generated it.

George

@hripcsa:

Yes, Episodes are pre-defined and have a concept_id from their own domain. The definition is implicit. There are only a few cancer episodes, and their number is not expected to grow much. The non-cancer episodes (pregnancy etc.) have to yet be defined. That’s what this whole long debate started with.

Also, Episodes maintain their link to the facts that made them, and the the link to everything that belongs to them. At least that is the idea. We don’t have experience whether this is feasible.

But the problem is the operationalization. @rimma and the “Derivers” believe they can be built from the structured entities in the core OMOP tables using a deterministic algorithm. I hope that is true, but my hunch as a “Contextualist” tells me this will require more input from the source (such as NLP, background knowledge about availability of data, abstraction). But even if it were, the definitions are going to be drastically more complicated than the typical phenotypes. This is somebody’s PhD job.

But we don’t know until we actually have tried. So far, this is a theoretical discussion. We need data, and we need data including the ground truth, so we can figure out the algorithms. So, as much as I enjoy the discussion we need to focus on making progress.

Please. But it seems we now even need to clarify what that goal is we’re making progress towards.

My assumption: an international standard representation for oncology data in OMOP for the entire community to leverage

But…

This seems like something different.

Pragmatically, why don’t we just slap a new column on the episode table to provide context as to how it was created, algorithmically or something else, and push forward? Or is there more to it?

i am still hung up on whether this is a ‘derived’ table or a clinical table. It looks like CDM workgroup decided it is a derived table. But @Patrick_Ryan said above, and @Christian_Reich is arguing that it maybe mapped from source.

Why is it categorized as a derived table?

@rimma and “Derivers” have introduced modifiers to provide explicit context for @Christian_Reich and “Contextualists” for episode derivation. There will be multiple approaches to episode derivation, including deterministic and probabilistic.

We already have it: epsiode_type_concept_id. If we add concept IDs to each algorithm and make it transparent and available for peer review and validation, what else do we need?

As much as we may disagree about conceptual definitions, we all seem to agree about this.

Because in the Oncology WG, we have developed a set of conventions that only supports derivation of episodes using regular OMOP tables. You may theoretically derive drug eras from the source tables. However, it will not change DRUG_ERA status as a derived table because of the accepted community conventions.

What data specifically? That illustration above was based on a real patient journey.

If certain examples from specific sources would help push some mutually acceptable solution I’m happy to try and dig some up. Though, I do know this guy who works for IQVIA…

Even if everything else around episodes is unsettled, if we can get to the point where we can continue to work on the representation of oncology data at the low level (in standard OMOP tables), that would allow us to push forward as a WG.

As the low level representation has been considered a prerequisite of episode derivation up until now, and it sounds like there is opposition to that idea of it being needed beforehand, but is there opposition to the need of having a representation of oncology data in the low level in general @mgurley ? If not, we can keep going developing standards to accommodate the low level representation, perhaps in parallel with episode development, otherwise we’re stuck until this is settled.

@rtmill, we should keep going!

This is definitely a great debate and I am having a lot of fun reading the thread. What is very clear from this conversation is that we are still in the theoretical mode, but I also see a clear enthusiasm on moving forward and coming up with a solution that supports the overarching goal of the oncology module: enabling observational cancer research.

I admit that there is still no clear answer and a path forward YET. We still do not know what approach is most suitable for each context we are trying to represent. The only way to get to the bottom of this is to start testing these out and rigorously evaluate the performance each approach using real data. For this, we need to start working on use cases and bring the clinical researchers and data partners together.

So, let’s make progress.

@rtmill Deffinitely no opposition to the need to have a representation of oncology data at the low-level. I just don’t want us to shelve documentation of EPISODE and EPISODE_EVENT and not encourage their population because we are are waiting on the development of a post-ETL derivation algorithim – because it ultimately might prove too technically unwieldy. The Github/forum post is asking for documentation, not the promise of documentation.

I want to leave the population of EPISODE and EPISODE_EVENT as an open empirical question that can only be answered through implementation, not theorizing. To me, your patient journey example across multiple sources proves the exact opposite of the possibility of any process being able to sort things out once all local context disappears. But you might be right. But if you are wrong everyone interested in using episodes will have been stymied from making progress.

So, yes, I believe we should continue developing standards to accommodate low-level representation in parallel with episode development.

t