Fact relationships: Searching for an extensible approach

rtmill · February 24, 2023, 11:11pm

Disclaimer: This post advocates for increased use of the FACT_RELATIONSHIP table and may not be suitable for all viewers.

Problem: There is a growing list of disparate use cases that require persisting relationships between clinical events to which the existing foreign keys are insufficient.

I view the overall goal simply as establishing a common convention. Not everyone will have this relational information explicitly in their data but for those that have use cases and wish to preserve it in OMOP, merely settling on a standard approach as to how to store it to enable interoperability is sufficient.

Some of the use cases:

details (modifiers) of conditions (staging, topography, subtype, findings, progression, etc.)
details (modifiers) of procedures (margins, complications, etc.)
linkage between NLP derivation and resulting clinical event records
providing context as to what procedure/test/device exposure lead to an observation or measurement (most notably from derived data) e.g., interpretation finding from a specific type of imaging; a genomic measurement from a specific type of panel, etc.
a condition resulting from another condition (e.g., metastasis)
a procedure targeting specific condition (e.g., therapy targeted to a specific lesion)
A drug or procedure as part of a regimen/treatment plan
Temporal, one:many relationships when the model constraints require 1:1, such as patient or care site locations over time, patient to provider relationships over time, etc.

This overall need has been incrementally enabled on a use case based approach, growing organically.

Current Approach: Two-part foreign key

To enable storing attributes (modifiers) of conditions and procedures, two fields were added to the MEASUREMENT table. This was initially implemented with the complexities of oncology data in mind.

This mechanism can be thought of as a two part foreign key { table identifier, row identifier } which allows the flexibility to attribute a MEASUREMENT, or modifier, to any row of any table.

Additionally, the EPISODE_EVENT uses this same mechanism to link an EPISODE to any other clinical event.

And, as was recently ratified in the CDM WG, the same mechanism will be added to NOTE_NLP to provide provenance to NLP derived events in other tables.

This approach is useful but there are three notable deficiencies:

In terms of extensibility, to fully encapsulate the spectrum of relationships between clinical events from existing and emerging use cases, we would need to add these same two columns, redundantly, to nearly every table. I’m sure we could come up with an example for every table to table combination
There is no context as to what type of relationship exists. It is simply a foreign key; “this is related to that”
There is the potential to run into a recursive situations, making it difficult to functionally account for. E.g. “a measurement of a measurement of a measurement” – at what point do the functions stop checking for additional relationships?

After running into the issue of wanting to create a relationship from one condition to another condition (i.e. representing a metastasis with it’s own distinct attributes), I dug into the possibility of a more extensible and reliable approach, described below.

Approach 2: FACT_RELATIONSHIP table

Let’s look at the fact relationship table:

Currently, it seems to mostly be used for relationships from one person to another, but has flexibility for much more
Essentially, it has the ability to cover all of the above use cases without needing to append addition columns to the event tables.
It provides the missing “context” as to what the relationship is
From a downstream perspective, I would suppose this approach would be easier to integrate with existing OHDSI tooling vs. needing to check every table’s FKs
The presence of a concept to represent the type of relationship allows for mechanisms of validation and tooling

The “relationship” domain represents what can be stored in this “relationship_concept_id”. An example is :“aunt of subject”

If we were to leverage the FACT_RELATIONSHIP structure and add a set of robust concepts with functional relationships to contextual concepts, it could be an extensible approach applicable to the entire spectrum of use cases mentioned above.

For example, using a couple of the use cases above, adding both a “to” and “from” domain relationship for validation/DQ check purposes:

Domain ID	Concept Class ID	Concept Name	(Relation)Has concept_id_1 Domain	(Relation)Has concept_id_2 Domain
Relationship	Attribute	Attribute of Condition	MEASUREMENT	CONDITION_OCCURRENCE
Relationship	Attribute	Condition has attribute	CONDITION_OCCURRENCE	MEASUREMENT
Relationship	NLP derived	Observation from NLP	OBSERVATION	NOTE_NLP
Relationship	NLP derived	Note has observation	NOTE_NLP	OBSERVATION

Temporality

To take things a step further, there are a handful of other use cases that have been discussed, requiring creation of new tables, that could be covered by the same structure if temporality is included. Notably, the representation of data where the model forces a 1:1 relationship and the source data is often one:many, such as:

Person to location; Care_site to location
Person/provider to care_site
Person to provider

If we wanted to kill a dozen or so birds with one stone, we could then add start and end date to FACT_RELATIONSHIP, as optional fields, which would additionally cover these use cases.

Something along the lines of:

(Revised) FACT_RELATIONSHIP

CDM Field	Datatype	Required
domain_concept_id_1	integer	Yes
fact_id_1	integer	Yes
domain_concept_id_2	integer	Yes
fact_id_2	integer	Yes
relationship_concept_id	integer	Yes
start_date	Date	No
end_date	Date	No

And perhaps for further validation/DQ purposes, the relationship concept could have a relationship to a context concept to indicate whether it is temporal in nature or not.

E.g.

Domain ID	Concept Class ID	Concept Name	Has concept_id_1 Domain	Has concept_id_2 Domain	Has relationship type
Relationship	Location History	Resided at location	PERSON	LOCATION	temporal
Relationship	Location History	Had resident	LOCATION	PERSON	temporal
Relationship	Provider History	Had PCP	PERSON	PROVIDER	temporal
Relationship	Provider History	PCP of patient	PROVIDER	PERSON	temporal

The above example is one approach but the “is temporal” context could be handled in other ways, such as determined by the domain or concept class.

Implications:

The altered approach seemingly simplifies and futures proofs the CDM for the existing and emerging use case demands
Can revert back to standard CDM and remove the additional columns from MEASUREMENT and NOTE_NLP
Eliminates the need for the tables: EPISODE_EVENT, LOCATION_HISTORY (as well as proposed others such as PROVIDER_HISTORY, CARE_SITE_HISTORY, etc.)
Adds ability to distinguish between different types of relationships within the same domain pairs (e.g. “patient lived at location” vs. “patient works at location”)
Not to open another large can of worms at the same time but, there have been developing discussions of a need to record the “derivation method” for certain types of data where the type concept alone is not sufficient. For example, the specific NLP, drug regimen derivation, episode derivation, geospatial algorithms used to derive the data. This is helpful to adequately preserve the provenance when comparing data across sites in studies. Should that become a reality, that we have something along the lines of a “derivation” table, it could seemingly easily slide into this same mechanism

Unknowns:

Regarding integration with existing tooling, what are the implications between the two approaches. As far as I know, none of the “two part foreign key” functionality has been built into any ETL tooling (e.g. Perseus), ATLAS/WebAPI, or HADES. Neither approach is entirely straightforward for ETLs given the complexity of the nature of inserting into identity tables and preserving the relationship using those keys, but I’d be curious to hear from the developers of the above if one or the other approach is viewed as more feasible. @schuemie @anthonysena @bradanton
Efficiency wise, I would imagine that moving all of these relationships into a single table could be an efficiency concern, but it isn’t clear if that is valid or not given the FACT_RELATIONSHIP table would consist of indexed integers
Whether or not I will be burned at the stake for this FACT_RELATIONSHIP heresy

Thoughts?

rtmill · March 2, 2023, 12:55am

A post about fundamentally changing relationships in the CDM as well as a new concept structure to support fact relationships, and it STILL didn’t trigger @Christian_Reich ? Now I can’t decide if this is a good idea or a really, really bad one.

@clairblacketer You had mentioned reservations about FACT_RELATIONSHIP in the past. Are there some fundamental flaws you see in this approach?

Additionally, without any feedback here, especially without any valid pushbacks, I’m not sure on next steps other than formalizing it into a proposal for the CDM group to discuss?

Mark_Danese · March 2, 2023, 1:06am

FWIW, one of the problems we had with doing research using the OMOP CDM was that it wasn’t capturing relationships among data elements in a way that was useful to us for our research. I have not wrapped my head around your proposal but I understand the desire to try and address the issue.

elisa.henke · March 2, 2023, 6:54am

Hi @rtmill, we faced the same issues as you described during the semantic mapping of our clinical data from the University Hospital in Dresden (Germany) to OMOP CDM v5.3.1. We are currently using the fact_relationship table to link the following information:

condition and its site localization
condition and its severity
condition and its stage
primary and secondary ICD codes
medication order and medication administration
German procedures (OPS codes) and surgical procedures
history of travel (location with observation)
procedure and used device
tumor diagnosis with lesion/therapy/surgical procedures/anastomotic insufficiency/examined and affected lymph nodes/ECOG/grading/staging/L-, V- and R-value/histology/TNM

Nevertheless, we share your opinion that the current approach using the fact_relationship table to store relationships of clinical events is not the best. We had a lot of problems to implement these complex mappings in our ETL process. In this context, we also identified performance issues regarding the mapping to fact_relationship.

We also would welcome another approach to store relationships more easier in OMOP CDM.

Christian_Reich · March 2, 2023, 7:35am

Friends:

This is an ongoing debate. Here is the deal:

We should not use data to connect reference data, particularly of the same domain. In other words, the vocabularies should not be enhanced with FACT_RELATIONSHIP or other clinical fields of the OMOP CDM. If we need combinations of condition and its site localization, condition and its severity, condition and its stage, primary and secondary ICD codes or German procedures (OPS codes) and surgical procedures, to use @elisa.henke’s examples, we should create OMOP extensions combining these concepts, including the right parenthood. Why? The reference table should be sufficient to query any OMOP CDM instance. We should not have to dive into the actual data at analysis run-time to find out what is there.
There is debate over whether or not we need to be able to connect records. If it sounds strange to you that such a thing wouldn’t be an obvious requirement then here is why: The observational data observe what is happening. The associations are found (or falsely found because of bias or confounding) through analytics. The fact that some doc makes assertions about associations, like an allergic reaction due to some trigger, or an adverse reaction to some drug, is legitimate for patient care, but from an OHDSI perspective these are results, not input.
But we have exceptions where that general principle doesn’t work:

Usually, all data are patient-centric. There is no fact of one patient that would concern another. But this is not true for biological relationships, especially mother-child.
In cancer, we should pre-coordinate all the primary lesions, histology, lymph nodes, metastases, stages, grades, but that would cause a permutational explosion. So, we decided to post-coordinate cancer conditions with Cancer Modifiers in a standardized way.
There is Location, with the various types, which on first order approximation is static, but of course it is not.

To support these, we have FACT_RELATIONSHIP, but that is almost never used in analytics, even though some ETLers put a lot of effort to capture them. The reason is that most analytical use cases don’t care about those, they are more relevant to patient care. They are also very slow, and database performance is one of the challenges we have due to the nature of the data. Clinical trial data have it much better in that sense.

Bottom line: I am not sure we have a concrete use case-driven need to change anything. Folks who are keen to capture relationships can do that, but I don’t see a ton of use. Tell me if you have a different experience.

Andy_Kanter · March 5, 2023, 10:12pm

@Christian_Reich , perhaps I am not quite so willing to throw out associations made by clinicians to only be evaluated via statistical testing of large numbers of patients. If a provider tells me this is STAGE x disease, then that is important, regardless of whether there is sufficient other data in the patient record to calculate that phenotype. In just the same way, we if are told the metastases are from x primary, we should record that regardless of whether they have more than one primary (as there may be). More and more clinical documentation will rely on either newly pre coordinated concepts, or perhaps post-coordinated expressions. Reducing them to their component elements without linking does not seem like a sustainable approach. I think the location discussion on its own deserves discussion related to spacial and temporal associations which we do not otherwise have. In the case of the OMOP extension compared to let’s say IMO’s clinical interface terminology, I don’t see how that would be sustainable either (without a partner like IMO). All of this costs money. I think we are coming around to a model where the interface terminology could be managed within the standard concept tables, and allow for more instance-based associations (such as location and time) be associated via something like FACT_RELATIONSHIP.

rimma · March 7, 2023, 9:43pm

There are two aspects and related challenges we are facing. One (A) raised by @Christian_Reich and @Andy_Kanter is about use cases: is it indicated to use association between clinical events in OMOP, regardless of how they are implemented? The other (B), stated by @rtmill, is about conventions for implementation: MANY-TO-MANY (FACT_RELATIONSHIP) or/and TWO-PART-FOREIGN-KEY (ONE-TO-MANY) relationships.

This is my take on both aspects.

A.
@Christian_Reich’s opposition to associations between events other than analytically established is grounded in the traditional OHDSI approach. It may work in many clinical areas. However, in oncology, throwing associations noted by physicians often causes loss of critical information. For example, when a physician points to the first line of treatment for a certain cancer, one cannot ignore it as it 1) gives 100% confidence that the treatment is the first line indeed 2) indirectly points to the time around initial diagnosis, - both factors critical to answering research questions and may be obscured in the source data because of the nature of oncology treatments performed at different institutions. @Andy_Kanter presents other use cases. We need the ability to preserve these valuable associations. Therefore, we have established condition and treatment modifiers.

B
After so many years since introduction of FACT_RELATIONSHIP, it has not been integrated in any OHDSI tools. I don’t think that resurfacing it without commitment to incorporate it in the tools is practical. This statement is coming from a big proponent of using FACT_RELATIONSHIP for structurally and semantically meaningful and correct representation of complex health data relationships. TWO-PART-FOREIGN-KEY solution is not supported by any tools either. I think, first, we need consensus about A: do we need to persist relationships at all?

If we do agree about the need, there is a place for both solutions depending on the use case. My rule of thumb is: if one fact cannot exist without the other (e.g. initial diagnosis fact without condition fact) then, it’s a modifier, the relationship is ONE-TO-MANY, and it belongs to the TWO-PART-FOREIGN-KEY. If two facts can exist independently (e.g. cancer diagnosis and germline mutation), the relationship is MANY-TO-MANY and it belongs to FACT_RELATIONSHIP. I know, sometimes the line is blurry (e.g. metastases may exist without the condition representing primary tumor) but conventions can be established to cover these use cases.

In conclusion, I am for 1) having designated relationship, either derived or noted explicitly, preserved in the CDM; 2) for having clear semantic and structural rules for their representation; 3) for integrating them in the tools. I think it can be accomplished gradually by developing tools covering specific use cases and demonstrating value (e.g. Oncology Regimen Finder).

rtmill · March 8, 2023, 8:17pm

Admittedly that was quite a bit of content in the original post. I’ll try to boil things down. For the sake of tackling one problem at a time, I’ll focus on the non-temporal aspect to start.

Let’s try to break it down:

Assumption 1:
There are valid use cases for relations between event tables that are not currently feasible to store within standardized conventions.

Assumption 2:
We should consequently create a standard mechanism for preserving this data in the CDM.

Assumption 3:
Leveraging a single mechanism, in this case an altered FACT_RELATIONSHIP table and related concepts and concept relationships, is a simpler, more extensible and systematic solution than appending the two “foreign key” columns to every event table.

Question:
Do we agree or disagree with the above assumptions?

Additionally, on the NLP call today @JohnOsborne presented on the gap referenced in assumption #1, but also referenced that the current form of FACT_RELATIONSHIP also lacks the ability to store provenance and context. Perhaps you could weigh in here?

Chris_Knoll · March 8, 2023, 9:12pm

Hi, Everyone,

I don’t think those assumptions are invalid, but I will say that an assumption that there are valid use cases for relations that are not feasible to store with standardized conventions is in conflict with the notion of a standardized data model. I think it would be better served to extend the CDM to concretely define and standardize the data relationships in the way we have done so (people have drugs, visits, observable events, and each of these domains are concrete things).

My main concern with the discussion over the many years about leveraging FACT_RELATIONSHIP is that the notion of using a FACT_RELATIONSHIP function int he CDM is akin to introducing an Entity-attribute-value model into the CDM.

Why is this problematic? IMO, it is because an EAV model does not have any structure about what goes into a particular domain of information. One Entity could have 7 attribute-values assigned to it while another might have 14. With the domain tables in the CDM, you have standardized structures (for example DRUG_EXPOSURE) where you know what pieces of information are available to assign, and what parts are missing (by a null column values, for example).

To bring it around to the FACT_RELATIONSHIP context: you want a way to define extensible and flexible relationships between CDM records, but there’s no way to standardize any of those relationships (ie: will every single record record the exact same series of relationships between them?)

Just like the EAV example above, you could never describe an entity as having a ‘standardized’ set of attributes…all you could do is find out what attributes are in the universe by looking at the district attributes that a particular entity type contains. And any data source could define their own set of attributes that has no meaning to another data source. This is the same sort of thing that would happen with an ‘extensible approach to FACT_RELATIONSHIP’.

What I would ask back to the community is: what information are you trying to model? If there’s a specific use case in oncology for fact relationships to describe treatment regimen, then just make a well-defined data structure that can be used to describe treatment regimens.

There’s another use-case to link mother to child or family trees. Why not define a person_ancestor structure to model it?

FACT_RELATIONSHIP will just spawn a series of one-off solutions to one-off problems, again in my opinion. I hate to rain on anyone’s parade, but I would have expected that if there was a strong solution to this problem, it would have emerged by now and it would have been adopted.

rtmill · March 8, 2023, 10:56pm

Thanks for the thoughtful response @Chris_Knoll

Could you elaborate on two things?

Do you view this as an issue for both approaches mentioned in the initial post (two part foreign key vs. modified FACT_RELATIONSHIP)
Could you specifically give an example of what sort of issues this might cause?

I view it as - the structure and design of domain stays as designed but with the addition that the events between or within the same domain may be related to each other in the source data. The events within those domains are still valid events in their respective domain that can exist independently, you just lose context without persisting the relations; e.g. a measurement of tumor dimension when there are multiple tumors.

Specifically, your example of not knowing whether or not an entity has 7 or 14 relations confused me a bit. A visit_occurrence could have 7 or 14 drugs, procedures, etc. The difference would be how you would query it to validate that, which is where the discrete list of “relationship type” concepts come into play.

Re: Use cases - That we have in spades. Happy to provide more than the ones mentioned at the top should it be helpful. And yes, oncology is a big source of them, which makes sense given the CDM is person-centric, whereas much of oncology data is cancer-centric or tumor-centric.

Chris_Knoll · March 9, 2023, 1:58pm

Hi, @rtmill , thanks for the questions.

I think the approaches are implementation detail that supports the general function of records being related to other records. So, yes, both approaches are there to serve this many-to-many relationship between records, and so my concern would apply to both approaches.

I think the main specific example I can give is one where a datasource owner can decide to load their CDM with thousands of different relationship types between records (ie: one concept_id would per relationship type, which could be thousands). These relationships preserve context in a very source-specific way that will not be standardized across the CDMs in a network, making it difficult or impossible to standardize the analytics applied to them. You probably have a series of ideas that would connect records from different domain tables in creative ways. I think if you asked 10 people, they’d give you 10 unique ideas about relating records together. How can we standardize that?

Sorry I wasn’t clear: your example says a visit_occurrence could have 7 or 14 drugs, procedures , etc, but if you complete that list, there’s only 5 or 6 different things that are related: drugs, procedures, conditions, observations, measurements and…? It’s a very specific list that covers different use cases of ‘facts’. That was my point about not knowing: the CDM model has those 5 or 6 domains of information to capture about the patient, but in a EAV model, there’s really no limit to the number of attributes you can put to an entity (like a visit or a person), and likewise with FACT_RELATIONSHIP, there’s really no limit to the number of ways you can relate records to each other (only limited to the relationship_concept_ids). I just am not sure how that’s something you can standardize. You can standardize the mechanism where ‘records have limitless relationships with each other’, but I’m not sure that sort of standardization is useful. Need to see it in action!

I think beyond just the use-cases, it should be possible to demonstrate the application of FACT_RELATIONSHIP (in either approach provided). I think someone stated that it’s the tools fault that we don’t have more use of the relationship table. I don’t think that’s a fair statement: we don’t need tools to demonstrate how these relationships would be applied in a study. You can write the custom queries for FeatureExtraction to build a characterization report and models used in propensity scores without ‘tool support’. Or just write custom queries to fill an R dataframe and perform some summary statistic on it. In your example of the multiple tumors: what is the relationship you would want in that case? that the 3 records are related to each other such that you have relationship that tumor 1 ‘happend with’ tumor 2, tumor 1 ‘happend with’ tumor 3, tumor 2 ‘happend with’ tumor 3? And will the inverse have to be defined: 2 happend with 1, 3 happened with 1, 3 happend with 2? So we can declare victory that we now have associated the 3 tumors together…now what? How is it used in an analysis?

If we could have people demonstrate the application of fact-relationship information (beyond saying ‘we’ve represented a relationship between two events in data, hooray!’) it might surface more specific problems about how to extract this information out of the CDM for analytical use.

Andy_Kanter · March 9, 2023, 3:18pm

It is a shame if this conversation has happened before (perhaps several times over the history of the CDM), so I apologize. Perhaps it should be captured on the Wiki for posterity in summarized form.

I was making the distinction that @Christian_Reich made between knowledge linkages and instance data linkages. Coming from the perspective of a user interface terminology, I see numerous examples of clinical concepts documented in the EHR which require multiple reference/standard codes in a particular configuration. These then become a new reference term and the easiest solution would just be to put all of IMO into the standard terminology tables and use the relationships in Athena to link the standard codes. However, this is not really sustainable as there will soon be a lot more post-coordination happening (particularly with ICD-11), so you can’t always assume there will be a predefined, pre-coordinated concept in the dictionary. Jack-alope was getting to a constructive model where a SNOMED expression can be used to generate a hashed standard concept. This is not a bad idea, but it causes problems when trying to relate this non-curated compositions to anything else. Perhaps there is a middle ground where we do take common multi-code expressions and curate them as standard concepts, but that custom expressions live initially in another table to be later harvested and promoted to standards (like the CDM maturity model itself).

For instance-data relationships… like temporality or spaciality (location) this doesn’t make sense to create standard concepts for… so we need a separate, governed process for capturing these. Coming from the OpenMRS world, I love the EAV data model+concept dictionary for capturing data, but agree, it is a nightmare for analysis. However, I don’t think we need to make the perfect the enemy of the good, and perhaps there is a maturity model approach here that allows for standards to evolve in a controlled manner.

rtmill · March 10, 2023, 10:56pm

To me that speaks to the community need of such a mechanism!

Agreed. It’s worth at least acknowledging that there is a dichotomy of implementers in the CDM - those with only claims data and those with other sources. Finding a balance between creating solutions to enable use cases only relevant to subsets of implementers of the CDM while still preserving the ability to query them together when the data overlaps seems key.

Standardization is exactly what I’m after here. A discrete (albeit likely evolving over time) set of concepts, let’s call it the ‘fact relationship vocabulary’ for discussion purposes, that defines the scope of relationships in which are considered valid. Anything placed in the modified FACT_RELATIONSHIP table that does leverage these concepts in an appropriate manner should be considered invalid and not included in any analysis. Additionally, if we require all of those relational concepts to include indications as to what are the appropriate “to” and “from” domains of that relationship, it opens up opportunities for validation/DQ checks.

A fair point, but I’d argue the source data would limit the amount of relations that could be possible between valid entities in the standard CDM. From my perspective, we would only be enabling relationships that are explicitly given to us in the data, and we just need to find a way to preserve that relationship.

The content of the current “Relationship” vocabulary may be causing confusion here and perhaps should be thought of as separate as it solely contains very specific relationship types (~200) listing out every iteration of how a person can be related to another person. What I’m referring to are much more high level concepts that do not contain that level of granularity. “Procedure given to treat condition”, “Observation from imaging procedure”, “drug as part of regimen”, etc. Point being, a much smaller list than folks might be thinking… at least to start.

If clarity on the proposal would still be helpful, I can try to mock something up this weekend. A few mock examples with standard concepts, relationships between them, and perhaps how it could be integrated with the tooling?

rtmill · March 13, 2023, 10:20pm

As alluded to above, I’ve attached a mock-up diagram of an example use case using the proposed approach. Admittedly, it is a bit confusing, and overly colorful, as it is an attempt to represent quite a bit of content within one graphic, but hopefully it can provide some clarity as to the mechanisms being discussed.

Specifically, the diagram attempts to illustrate the following example:

a procedure related to a specific condition record
that same condition record with a related modifier (measurement)
that same modifier with a relation to the imaging procedure it was derived from

Alexdavv · March 14, 2023, 12:50am

Hi @rtmill!

It’s a nice and very clear graph!

A couple of things:

Domains and tables are different. In the fact_relationship table we link the tables (events from them), but call them domains. That already creates confusion. Do you want to do the same in your newly created fact relationships? Or do you want to define the rules really for domains, not for the tables? If so, then you’d link them to the domain concepts, not the table concepts, right?
What’s the use case for these new concept_relationships? Is there an idea of any machine-readable way dealing with the domains they are linked from/to?
Condition modifiers are linked using the reference keys in the Measurement/Observation tables. Do you record this things twice or suggest to revoke the previous way?
What are the examples for Procedures that have some Measurements? Don’t we claim the Procedures the Measurements themselfes if they measure something?

jmethot · March 14, 2023, 1:01am

@rtmill: Thanks for broaching this topic. It’s a hard problem, which is why it hasn’t yet been wrestled to the ground despite multiple cycles over years.

I agree that a scheme like this is preferable to two part foreign keys for the reasons you stated: more flexible, easier to validate, and just more elegant. You don’t specify how many relationship concepts you envision. I can imagine both “standardized” and “extension” approaches to relationships.

Standardized Approach

In your latest post and diagram I think you are proposing that there be a small set of standard fact relationships (you don’t show it your concept table but I think you mean for the 999999 and sibling relationship concepts to be Standard). If we limited the standard relationship concepts to those between event tables that makes it much more manageable.

I tried mapping out the combinations that implies and immediately ran into semantic issues.

We can’t define relations such as “drug is indicated by condition”, or conversely, “condition treated by drug”. OMOP is only intended to record facts, not “knowledge”. We humans can infer that metformin is ordered for diabetes and a beta blocker is ordered for hypertension, but the source facts we have to work with don’t explicitly include those semantics (@Andy_Kanter tells me that’s changing, but I think widely true in the general OMOP world). In OMOP we infer those semantics at analysis time via temporal proximity of the events (e.g. drug exposure within time window after diagnosis).

I think your example relationships “procedure targeting condition” and “condition has targeted condition” are instances of this semantic “overreach”.

So I think all we can say in the general sense is “A relates to B” and its inverse. Expanding those to all combinations of event tables like in my diagram may be useful from a validation perspective (validation code can ensure that the ends of each relation are in the correct domain; although in reality all they can really do is make sure the keys actually exists in the two event tables, not that the relation itself is “valid”).

I’m having trouble thinking about how much analytical value you lose by having only the most generic (but at least standard) relationships.

Extension Approach

Another more open approach would be to permit sets of relationship concepts be defined by different OHDSI communities, e.g. oncology, psychiatry, dentistry, health equity, etc.

Those communities have interest in more than just relationships. The area I’m most familiar with, oncology, produce coordinated additions to vocabularies (including new vocabularies), the CDM with the EPISODE table, and the “two part foreign key” convention for expressing relationships (which you are attempting to improve in this thread).

Reading this thread, I’ve thought of calling those “overlays”. Members of specific OHDSI communities would develop them, promulgate them, and promote standardization of them. Network study designs could specify that to participate a site must have “standard CDM plus the psychiatry overlay” meaning the site needs to have added psychiatry specific data adhering to the psychiatry overlay conventions. We will already do that in oncology: the set of useful network studies using only EHR data is small; for real power studies will need to require at least that sites have mapped cancer registry data or its equivalent.

In a meeting today, @Andy_Kanter mentioned that the CDM WG is discussing a mechanism like this called “extensions”. The oncology work is sometimes referred to as the “Oncology Extension” but I didn’t realize there may be a wider discussion ongoing about a model like I describe here. @clairblacketer might be able to shed some light and tell us whether it resembles what I described.

Conclusion

I think both of these ideas address the standardization concerns expressed in this thread; the first by severely limiting standard relationship concepts and the second by defining how sets of relationship standards could be developed by OHDSI communities and eventually promoted to standard concepts.

rtmill · March 14, 2023, 1:40am

@Alexdavv & @jmethot Thank you both for the constructive and important feedback.

You are correct, I should have used the domain concepts and not the table concepts

The thought here was to have a easy means of querying the domains for validation and DQ purposes, perhaps also to be used in tooling (e.g. ATLAS cohort definitions to limit the options available). The alternative could be something like a consistently structured string of ‘concept class’ (e.g. ‘CONDITION | PROCEDURE’) but I figured a join is better than expecting a regex.

The latter.

This would be most relevant when you want to preserve the relationship of the specific procedure context and it’s relevance of the observation/measurement. Say for instance you wanted to query for a specific finding only when it’s from a path report, or using a specific type of imaging.

A big consideration here is how much NLP derived content changes how we look at where and how we are generating data from the source. Theoretically there would also be connections to the NOTE from both the procedure and the observation/measurement.

A great point that should be clarified. In this initial proposal I’m only referring to examples where there are events explicitly related in the source data. It may seem rare, but I have talked to several sites that have these linkages in their data, so there is no “guessing” or other forms of inference. A care plan linked to a condition, for example, would explicitly give you that these X interventions were intended for a given condition. @gkenno may be able to elaborate on other examples

Regarding the rest of your response @jmethot - it’s worth noting that in order to push this forward I’ve intentionally excluded two key pieces to keep the initial scope of the conversation narrow (well, recently I have, but they are in this thread). The first being temporality, and how this approach, albeit ammended, could be expanded to knock out a flock of bird with one stone. The second, and more relevant, is provenance, to which @JohnOsborne spoke about on the NLP call.

In the scope of this first hacking out of the details, I’ve been trying to limit the conversation to those in which the provenance, likely represented as something like a relationship_type_concept_id would always be something along the lines of “explicitly stated in the source data”. These other applications, where inference and/or algorithms may be used to establish these relationships, falls into a different category and consequently a different type of provenance. At least that’s how I’ve been thinking about it.

Lastly, in terms of your diagram @jmethot - are the “semantic issues” you referenced covered by my explanation above or is there more to it? To make it even more complicated, I would add to that diagram relationships to NOTE as well as circular relationships from the same domain back to itself - though the only use case I’m aware of is condition to condition.

Circling back to your suggestion - would a simple short list of concepts with “this domain to this domain” be comprehensive enough - albeit with proper provenance recorded? Seems like there could be many different type of relations between the same domains with use cases that would benefit from a higher granularity, though I’m not sure.

Chris_Knoll · March 14, 2023, 3:37am

So, the point of my statement here was that we need more than showing how records can be inserted into tables. We have several use-case categories in the OHDSI toolstack, ex: Cohort Definition and Characterization.

For Cohort Defintion, how would this information be captured be used in a cohort definition. Can we have an explicit example with Cohort Entry events using the information above. For characterization: what would be summarized at a population level?

I apologize if I jumped the gun and you had the demonstration of the above methods with the example data forthcoming.

rtmill · March 14, 2023, 4:08am

@Chris_Knoll Indeed I did fall short of the promised deliverables. Below is an example using the 3rd/4th row of the FACT_RELATIONSHIP mock-up above.

Here’s a fully operational and tested implementation of a forked ATLAS instance and definitely not a MS Paint edit:

I was going to tag some of the leading devs on both ATLAS and Circe but it looks like you fit the bill on both @Chris_Knoll . If there are others that you think should weigh in please tag but - if the idea is clear enough here, does it seem feasible to implement? (in short - in the same vein in which nested criteria can be linked via the same visit, perhaps a link to records by using an explicit fact_relationship_concept_id in a concept set could achieve the desired result?)

Separately, not sure on the cohort characterization front, would need to look into it further.

jmethot · March 14, 2023, 2:35pm

@rtnill
We have to address the standardization concerns one way or another, either:

“Standardized approach” (probably “simple” or “fixed” would be better) in which just the fixed combinatorial set of generic relationships between events (and NOTES) is adopted as standard: “Condition relates to Measurement”, “Measurement relates to Condition”, etc. There would be a maximum of 56 of those relationships (8 event tables x 7 “to” connections to the others; some may not make sense), plus some that are “domain to same domain”.
“Extension approach” where a set of relationships is deliberately chosen to address analytically useful explicit linkage types in real data. New concepts can be added over time, either to augment an existing extension or as part of a new one.
“Free for all”, which is what the FACT_RELATIONSHIP table is today, and which can never support network studies.

I don’t think you’re really proposing “this domain to this domain” - that’s an example of what I call the generic relationships. You’ve said you want a linkage between drug and condition that has specific semantics of “drug ordered for condition”.

Settling on truly generic relationship concepts avoids complex concept growth (only ~56, as above, forever)

Allowing semantics invites complex concept relationship growth. It will run into problems when two communities want slightly different relationships between the same two domains.There will be lots of opportunities for divergence over the long haul.

But as I said above, I think generic relationships also diminish analytical value. The key question is how do we preserve analytical value but control/coordinate relationship concept growth over time?

So to answer your question: Yes, I think a specific complete set of relationship concepts you propose would help make the discussion more concrete.