OHDSI Home | Forums | Wiki | Github

[2022 US Symposium] #66 - Episode and Episode Event Tables Documentation

@Christian_Reich and @Gowtham_Rao ,
As I described in this thread, we introduced a list of condition modifiers (e.g. 734306 = ‘Initial diagnosis’) that supports preserving available source information related to episode definition. We have already added these modifiers to the ETL from Tumor Registry. We are also recommending this for ETL of any data that has insights about episodes. This data along with concept_type_id will be used for post-ETL derivation of episodes. Therefore, you can and should build Episodes on the OMOP CDM standard tables. Moreover, before we develop any tools that use Episodes, these modifiers can be also surfaced by available tools.

@Christian_Reich and @Gowtham_Rao, we are developing algorithms that are based solely on regular OMOP tables (e.g. Regimen Finder is based on DRUG_EXPOSURE). Therefore, they can be peer-reviewed. I have proposed an idea very similar to @Gowtham_Rao for a repository of algorithms where each algorithm will have a concept assigned. This will allow for preserving provenance of episodes in Episode.Episode_Type_Concept_ID.

Looks like @rimma disagreed with you. Also sounds like a claim that is not substantiated. So lets strike this off.

What are we talking about? changing data representation (i.e. converting to OMOP form) or deriving new events (i.e. using some intelligence to generate new data from other data using an algorithm) I am only interested in the later i.e. algorithm. I do not want to empower the ETL’r to do this - as then you are giving ETL’er too much power to interpret source data and derive new data and this is NOT good for reproducible research (e.g. ETL rules may be unknown, have errors). By using cohort approach - we are making them all available.

Lets strike this off - same argument as above i.e. derivation is not from source data but from core OMOP CDM tables.

Incorrect @Christian_Reich - i think few years ago that was the correct, but right now - we are thinking (we = OHDSI Phenotype Development and Evaluation workgroup) are thinking of a ‘target’ and cohort definitions are algorithms trying to identify cohort that match that target. Any deviation from target is the error. An individual study is not a component of Phenotype Development and Evaluation. Happy to discuss that - come join the workgroup :slight_smile:

I dont think thats correct either @Christian_Reich

Finally, the episodes keep the connection to the events they are built from, or they are related to (EPISODE_EVENT table). Cohorts do not.

All valid points, and use of Episode tables is valid. My position is not whether we need to populate/use Episode table. I am arguing against the ETL’r making undocumented/unreproducibile algorithmic choices to populate a table with derived content. If you want to just do a ETL of source to target - sure go for it, but the T should be minimal and have record level referential integrity to the source where possible.

If instead you have an algorithmically derived summary of multiple records in source, especially if it is running on source tables - then I think thats not good!

Ok - Re -reading all posts from top - these are derived tables and NOT clinical data tables.

and it has fields that are not in cohort table like following

These tables are to be calculated from omop core tables/clinical data tables and NOT source tables.

So - now some of the arguments makes sense…

but we will have the age old problem of not being able to fully trust these derived tables. e.g. in OHDSI network studies, we rarely seem to use condition_era, drug_era – but use the condition_occurrence and drug_exposure. this is the reason for THEMIS

Nice debate here. But it is getting long. Let me see:

Correct. If the source tells us what the episodes are we are all set. But the debate is about derivation when you don’t have that, à la @Gowtham_Rao’s phenotypes. He claims all you need is regular OMOP tables and you can do it, reliably, from the Conditions and Modifiers.

That would be wonderful. But apart from the fact that we are far away from having the logic for such algorithms for disease episodes, the question remains: Could they work without the context of the source data? Is this an ETL job, or an universal phenotype-like job?

You seem to be claiming that the Type concept will provide sufficient context for each Modifier (stage, grade, mets, nodes) telling us how much the algorithm should believe it. But there is more trouble lurking:

  • What about unstructured imaging and path lab reports?
  • What about contradictions between EHR and registries, or contradictions between different source information?
  • What about incomplete information? For example, some ambulatory clinic will record chemotherapy, but it will not record surgery in a adjuvant or neoadjuvant setting, or autologous stem cell transplantation. Similarly, administration of oral chemotherapy is often organized differently than parenteral.

In other words, all this is so messy that we need to give the ETLer some serious power to make the right choices. That is the point of the Episodes. The analyst using OMOP tables alone would be lost.

That’s a good thing. In our workgroup, we are actually debating things. :slight_smile:

:slight_smile: You realize that all OMOP databases are ETLed, do you? The ETLer has to make a ton of decisions of how to interpret the source data in such a way that it fits the intended representation of the CDM and vocabulary. And no, those decisions are not peer-reviewable. Plus: If anything I am not mistrusting the ETLer like you do. But if algorithms can make her life easier I am all for it.

What is that? And how is that not used for studies? Let me quote @Patrick_Ryan:

True. @Patrick_Ryan made a list and he community voted on them. Where did it get its wisdom from? They are needed for disease setting and outcomes in the studies folks are running all the time.

But you are right, I shouldn’t debate the phenotypes here, except whether or not they are the same thing as the episodes.

Back in Phebruary they certainly were. Do you have outcomes that are not conditions now?

You’d be surprised to hear that from me, but actually if we could create standardized episodes purely from structured data in the OMOP tables I don’t think we’d need them. We’d just use your phenotypes. Episodes only have a life if they need to be populated pre or peri-OMOP.

That’s what it boils down to. I hope you and @rimma will be right. Till then, my hope is we can arrive at some mixture: standard algorithms, that are using OMOP tables, but are configured with information the ETLer has obtained from the source data or by asking folks in the institution.

Great end to this focused discussion @Christian_Reich . The key insights i have learnt that reinforced some of positions.

  1. Episode table is a derived table. It is derived, like condition_era, drug_era, from the core clinical CDM tables.
  2. Although built during pre-processing/set-up of the CDM, episode table do not interact with source data in any form. i.e. it is a phenotype like algorithm.
  3. The output of the phenotype like algorithm is different from cohort i.e. its more than subject_id, cohort_start_date and cohort_end_date and includes elements that a cohort algorithm would not support.

Truly, the use case i think it supports is to algorithmically separate care events that may be unrelated. e.g. if i am getting knee surgery, but during the same days also get a dental workup - the episode table relates the events of knee surgery together i.e. bundle, but does not link to the unrelated events i.e. dental.

It reminds me of old discussion here How to Capture pregnancy data? (EDC, gestation length, etc) - #5 by Gowtham_Rao and the idea of ‘Episode of care’. Health insurance companies have done/been doing/tried to do this for many years.

Looks like my response was too long. :slight_smile: I am saying:

  1. It would be nice to derive it from core tables, but I don’t see it. It is an ETL job. Could be an algorithmic job that happens by the ETLer, while the context is still fresh, but not something that can happen completely independently like phenotypes.
  2. We are not 100% that about that either. There is still the possibility we have to attach Cancer Modifiers to Episodes, rather than Conditions
  3. Correct.

Yes, your use case is correct. Even though only somebody as industrious as @Gowtham_Rao can have a dental workup the same day that he is under the knife for knee surgery.

Thank you for the summary and discussion, @Gowtham_Rao @Christian_Reich etc.

Is @Christian_Reich saying there is information in the source data that is not preserved in the CDM that could have been used to create episodes if we just created them early enough? Including say a source registry telling us that this group of events is an episode.

If there is information about episodes, where would we put it if not the episode table (because that is derived)? In the type field, perhaps fact_relationship? I worry that those are not used consistently.

Speaking as an ETLer, I’ve been following this thread because with version 5.4, I either have to fill the Episode and Episode_Event tables or I do not. If I don’t, I’m happy as a clam (assuming clams are happy). But if I do, I have to figure out where this information is stored in my source. In my ETL persona, if it’s not in the source, it doesn’t exist. However, if I DO have it in my source, am I supposed to pretend it doesn’t exist?

My source system is a little EHR called Epic which does have Episode information, and I’m trying to figure out if their definition of episode is the same as OMOP’s.

“This table contains high-level information on the episodes recorded in the clinical system for your patients. When a provider sees a patient several times for an ongoing condition, such as prenatal care, these encounters can be linked to a single Episode of Care.”

It kind of sounds the same to me.

It appears that, in Epic, in started as an episode specifically for pregnancy, but it has been expanded to include others like transplants, radiation therapy, nephrology, home infusion, anti-coag, social care, and others. How well these and others have been implemented in my Epic system, I don’t know yet. Hence my interest in knowing whether or not I should bother.

So, if a physician determines that a set of conditions and treatments comprise an episode and defines it as such in the EHR, why is that not trust-worthy?

I agree it’s a nuisance that EHRs have differing degrees of reliability in their implementation of clinical data, but if we can’t trust the source, what are we doing here? Ultimately, it ALL comes down to the decisions that we ETLers have to make with regards to what and where the data goes into OMOP. Believe me, we don’t want the power to derive information, but we do it every day anyway.

Christian is always going on about use-cases, and it’s obvious there’s a pretty large use-case for oncology. But is that the only one? Can all use-cases use algorithms to derive episode data? Is it truly an either/or situation between source-derived or algorithm-derived? I do note that the Episode table includes the Episode_type_concept_id, which allows for differentiation between source and algorithm derived episodes.

I’m just concerned that the focus on oncology is narrowing the use of episodes too much.

1 Like

This debate popping back up is intriguing. It is the subject of which we discussed at length in the Oncology working group, over a year ago, and came to consensus to push forward with the assumption that episodes would be derived ‘post ETL’, or rather, defined by leveraging the ontologies and referencing concepts and records already in standard OMOP tables. @Christian_Reich It would seem that you now feel otherwise and I’d be interested in what changed, but it is clear you are not alone in this stance.

There are several reasons why this approach was preferred but what hasn’t been mentioned here yet is that the idea of creating episodes during the ETL is fundamentally impractical if you are merging data from more than one source with overlapping patient populations. The most common example we’ve encountered is where sites have both EHR and tumor registry data - the same patient population, the same disease occurrence, but different and complementary data. The more comprehensive your representation of the patient journey is, the more accurate your episode derivation can be. If you are deriving episodes during the ETL, without the full context of information available, and then try to merge that data together, you have conflicting and overlapping episodes that would need to be rebuilt with each additional data source.

Here is an image I created to illustrate the point (from an email in July of last year)


For context: clinical events are what bookend episodes

Outside of that, it boils down to a choice of the general approach of limiting bias and increasing interoperability by not requiring that the individual ETL developer make all of the judgement calls that would be required to infer these episodes. In other words, asking ETL development to only relay the data that exists in the source. As mentioned above, it is unclear how we can have any faith in episode analyses between multiple sites if there isn’t consistency into the manner in which they were created.

Not to say that this will be easy, in either approach, or even that the definition of these episodes will be consistent, say if one site defines progression in a different manner than another, but if we are able to keep track and codify these derivation methods, we enable the ability to run studies with confidence that the episodes were derived using the same criteria.

With the stated hurdle of the ‘during ETL’ approach illustrated above, unless there is some approach to work around that, what makes the ‘post ETL’ derivation process impossible?

For the ‘post ETL’ approach I see the requirements as:

  • Persist all relevant data in standard OMOP tables. If there isn’t already a concept or home for a piece of information, we make one. This clearly has the extra benefit of providing more evidence outside of the derivation process. If there is an argument that “this source data can’t fit in the standard OMOP tables”, I would ask why not enable it to?
  • Develop sharable definitions of episode derivations. Again, this in no stretch of the imagination will be trivial, involving both concept relationships and likely complicated logic, but wouldn’t leveraging the community to come up with these definitions be more feasible than asking each individual ETL developer to do so?

This portion of the book of OHDSI seems relevant here:

1 Like

Don’t pretend it doesn’t exist, write in your OMOP specification “not populated at this time”. Almost all CDM tables are optional. Let the use case drive the implementation. If you don’t have a use case, don’t populate the table.

And if/when you analyze your source episode data, I, and probably other Epic users, would be interested to know your findings. I was told our pregnancy episode data wasn’t everything our researchers desired. The hardest observation/condition to accurately identify is when a pregnancy ends with a miscarriage. But this is the nature of the data. Certain things (OTC drugs, miscarriage, minor injuries/illness) aren’t actively or accurately recorded in the data at time of event.

This is a great debate. A continuation of an issue debated during ‘Pehnotype Phebruary’:

I would articulate the issue as:

Are there limits to what kind of clinical facts can be represented by standard OMOP, phenotypes and standard analytics?

It is no coincidence that in Phebruary this issue was most hotly debated in phenotypes involving cancer. Cancer research demands the representation of clinical facts like ‘histology’, ‘staging’, ‘disease progression’, and ‘treatment lines’. The OHDSI Oncology Working Group was formed out of difficulties representing these important cancer clinical facts in OMOP.

Thus was introduced:
Structures: EPISODE, EPISODE_EVENT, MEASUREMENT.measurement_event_id, MEASUREMENT.measurement_event_id, MEASUREMENT.meas_event_field_concept_id.
Vocabularies: ICDO3, NAACCR, Cancer Modifier, Hemonc.org, CAP electronic Cancer Checklists (College of American Pathologists)
Standardized ETLs: NAACCR ETL
Treatment Regimen Detection Algorithms: OncoRegimenFinder, Tracer (AJOU University)

It looks like the OHDSI community has formed various ‘factions’ around how to answer the above-articulated issue.

  • Faction 1 (the ‘Methodists’):

    • All clinical facts can be represented by current standard OMOP, phenotypes, and standard analytics.
    • We don’t need the EPISODE table.
    • ETL’ers should not interpret source data and derive new data.
    • Rote ETLs and statistics are enough.
    • Members: Patrick Ryan, Gowtham Rao
  • Faction 2 (the ‘Derivers’):

    • Not all clinical facts can be represented by standard OMOP, phenotypes, and standard analytics.
    • We need the EPISODE table to represent cancer disease phases and treatment lines.
    • Episodes belong in the standardized derived elements.
    • We need to come up with modifiers (e.g. 734306 = ‘Initial diagnosis’) and conventions on how to populate cancer events in the traditional OMOP standardized clinical event tables.
    • These modifiers and conventions will enable the development of a post-ETL algorithm to derive cancer disease and treatment episodes from the standardized clinical event tables.
    • Episode population should not be opaque and depend on data source context.
    • ETL’ers using modifiers, adhering to conventions and a promise to develop a post-ETL algorithm are enough.
    • Members: Rimma Belenkaya, Robert Miller.
  • Faction 3 (the ‘Contextualists’):

    • Not all clinical facts can be represented by standard OMOP, phenotypes, and standard analytics.
    • We need the EPISODE table to represent cancer disease phases and treatment lines.
    • Episodes belong in the standardized clinical data tables.
    • We need to come up with simple semantic targets based on oncology research standards to support the population of disease phases and treatment lines in the EPISODE table.
    • Episode population can only be done in the context of source data by an advanced informatics infrastructure that supports an ETL’er.
    • Such advanced informatics infrastructure will include generating abstractions from unstructured imaging and pathology lab reports and reconciling EHR/claims data with multiple sources: e.g., tumor registry, oncology analytic platforms, and oncology EMRs.
    • Clear semantic targets and institutions with advanced informatics infrastructure are enough.
    • Members: Christain Reich, Asieh Golozar, Michael Gurley

I know these ‘factions’ are overly simplistic. Please take my personal assignments to the factions as an attempt to sharpen the contours of the debate. Nothing more.

One last thing that I will say is that the oncology world is furiously already engaged in the process of interpreting source data and deriving new data. There is a cottage industry of commercial and open-source solutions helping institutions to generate cancer disease phases and treatment lines. Every academic medical cancer center in the country is engaged in such efforts. So fighting against that tide is really a decision for the OHDSI community to be open or closed to incorporating these emerging oncology data assets. One thing the cottage industry has not provided is an open-source clear structure and semantics to represent cancer disease phases and treatment lines. mCode has made significant contributions in this area but is more focused on data collection and transport. Not open data analysis. I think OHDSI has a great opportunity to be that open-source structure and semantics for oncology. But we need to make the target simple enough that folks can achieve populating disease and treatment episodes within our lifetimes.

I hope everybody is enjoying themselves. I certainly am.

2 Likes

I encourage everyone who participated in this discussion to wait for the Episode convention documentation from the Oncology Workgroup as it resulted from multiple comprehensive discussions like this one with weighing pros, cons, and input from @mgurley, @Christian_Reich, @rtmill , @agolozar , and many others. I have to admit, we have never reached 100% consensus. However, to move forward, we used the majority rule which mostly agreed with @Gowtham_Rao’s positions stated above.

Ha, never knew I was in a faction:) Or that I’d be stuck in a faction with @Gowtham_Rao :slight_smile:

But just to be clear about my position, I am NOT opposed to the notion of a EPISODE and EPISODE_EVENT table, I just think that we need to clearly differentiate what we expect to go into these tables vs. other available derived tables, such as DRUG_ERA, CONDITION_ERA, and COHORT.

I remember an original (narrower) motivation for this was to capture oncology treatment regimens, where there was an expressed desired to capture the duration of time that a person was on some combination or sequence of different drugs (that duration would be in the EPISODE table), and there was also interest in maintaining provenance of the verbatim elements that comprised that regimen (those relationships were to be maintained in the EPISODE_EVENT table). I thought that these regimens may actually be coming from the verbatim data in some source systems, but whether it was sourced or derived doesn’t seem too material to me. In either case, its clear that DRUG_ERA isn’t the place for an oncology regimen, because currently it only represents duration of drugs at the single ingredient level. The EPISODE component of the regimen could be stored in the COHORT table, but the EPISODE_EVENT component would then have to be stored in the FACT_RELATIONSHIP table, which it seems most people aren’t actively of consistently using. Its likely that ATLAS couldn’t readily model all complex treatment regimens without additional features (like having exit criteria based on stop dates of multiple different drugs), but I don’t think the capabilities of our standardized tools should influence data modeling decisions.

When we start talking about more general use cases, such as any chronic disease states, then it becomes trickier for me to understand why we would use EPISODE and not COHORT, or vice versa, and from a data modeling perspective, if we end up with two viable options, then we have no good solution, because we’ll end up with different people using different conventions. COHORT is currently the table that we want to store the results of cohort definitions, which is our implementation of phenotype algorithms, where a phenotype is defined as: “a specification of an observable, potentially changing state of an organism”, and a phenotype algorithm is “algorithms that identify or characterize phenotypes, which may be generated by domain experts and knowledge engineers, or through diverse forms of machine learning to generate novel representations of the data” (per Hripcsak and Albers JAMIA 2017). Our OHDSI definition of cohort = a set of persons who satisfy one of more inclusion criteria for a duration of time. A cohort era (which is one record in the COHORT table) = one continuous period when a person satisfied the inclusion criteria. This cohort construct has formed the basis of how we develop our phenotypes, and also serve as the inputs to our various analysis packages.

So, the big open question is my mind is how do we differentiate which content do we expect to be in EPISODE and not in the COHORT table or elsewhere?

2 Likes

@Patrick_Ryan

Conceptually, Episodes are much closer to Eras then to Phenotypes/Cohorts. They are derived continuous periods of disease or treatments defined by formalisms, like treatment regimen, disease progression, metastatic disease, etc. Those formalisms are foundational for building cancer Phenotypes/Cohorts.

Having said this, it is a question of chicken and egg. To derive metastatic disease episodes, one will need a Phenotype definition of a metastatic disease that will adhere to all the rules of OMOP Cohort definition.

I believe the primary differences between Episode and Cohort are:

  1. Episodes are designed to formalize through vocabulary and represent temporally widely accepted cancer phenotypes while Cohorts will have multiple non-formalized and not organized temporally fragmented phenotypes. For example, first line of treatment, second line of treatment, first episode of progression, second episode of progression.

  2. Episodes are designed to streamline further phenotype building and analytics based on those formalisms.

(The reason why we introduced Episodes and not used Eras was the limitations of Eras including: inability to annotate Eras with those formalisms, inability to group multiple ingredients as you stated, inability to nest and chain Eras, limited algorithm for the derivation of Eras, and more.)

1 Like

I think of an era as something that is defined by a standard code, like an ingredient.

And a cohort has no formal name or code. Just an identifier and an implicit definition.

Will an episode have some standard code to identify its type? Or an identifier and you figure out what kind of episode it is indirectly from the definition that generated it.

George

@hripcsa:

Yes, Episodes are pre-defined and have a concept_id from their own domain. The definition is implicit. There are only a few cancer episodes, and their number is not expected to grow much. The non-cancer episodes (pregnancy etc.) have to yet be defined. That’s what this whole long debate started with.

Also, Episodes maintain their link to the facts that made them, and the the link to everything that belongs to them. At least that is the idea. We don’t have experience whether this is feasible.

But the problem is the operationalization. @rimma and the “Derivers” believe they can be built from the structured entities in the core OMOP tables using a deterministic algorithm. I hope that is true, but my hunch as a “Contextualist” tells me this will require more input from the source (such as NLP, background knowledge about availability of data, abstraction). But even if it were, the definitions are going to be drastically more complicated than the typical phenotypes. This is somebody’s PhD job.

But we don’t know until we actually have tried. So far, this is a theoretical discussion. We need data, and we need data including the ground truth, so we can figure out the algorithms. So, as much as I enjoy the discussion we need to focus on making progress.

Please. But it seems we now even need to clarify what that goal is we’re making progress towards.

My assumption: an international standard representation for oncology data in OMOP for the entire community to leverage

But…

This seems like something different.

Pragmatically, why don’t we just slap a new column on the episode table to provide context as to how it was created, algorithmically or something else, and push forward? Or is there more to it?

i am still hung up on whether this is a ‘derived’ table or a clinical table. It looks like CDM workgroup decided it is a derived table. But @Patrick_Ryan said above, and @Christian_Reich is arguing that it maybe mapped from source.

Why is it categorized as a derived table?

t