Fact relationships: Searching for an extensible approach

Mark · March 14, 2023, 2:59pm

I think the biggest problem is that we are using relational-style tables to represent graph data. Yes, it can be done, but it creates so many levels of complexity. At the risk of being redundant, we need our vocabulary stored in a graph database.
Perhaps for network studies, use a latter version of Postgres, with the vocabulary stored as a graph structure. I realize this would create issues for SQL Server only users, as it has no good way to represent graphs (XML is never a good idea), but the multi database approach will always give inferior results.

I am attempting to not get on my lack of standards soapbox. (edit: I failed)

jmethot · March 14, 2023, 3:55pm

@Mark I agree with you that a graph database is a more natural fit, but adopting OMOP already requires so many areas of expertise that adding an entirely new database type that is foreign to 99% of the community is a bridge too far (at this point).

rtmill · March 14, 2023, 5:39pm

@jmethot Valid points.

I was assuming if this got some form of a green light we could create an initial set of relationships based on the use cases we’re currently aware of and then poll the community to see if there are suggested additions. That would at least be a starting point for the vocabulary.

Perhaps your “generic relationships”, i.e. “this domain to this domain” can serve as the top levels of hierarchies, where any more specific relationships (your “extension” relations) that fall under that same “X domain to Y domain” relation is a child concept of the more generic. That way for tooling we can use the “include descendants” if there are broad use cases that don’t care about how the two entities are related, just that they are?

Mark · March 14, 2023, 6:14pm

If we are going to add this, can we do this with a unique relationship_id instead of the ‘maps to’, please? This is already a problem in ETL world with all the ‘maps to’ over various domains.

rtmill · March 14, 2023, 6:36pm

Within FACT_RELATIONSHIP the relationship is indeed already represented as a concept ID - an integer field and foreign key to the concept table

https://ohdsi.github.io/CommonDataModel/cdm54.html#FACT_RELATIONSHIP

Mark · March 14, 2023, 6:39pm

I took your statement earlier a bit too literal, sorry.

cce · March 16, 2023, 8:35pm

@rtmill Thank you for this topic. I don’t think there is one right approach. An important consideration is to look at how each proposal fairs across various contexts, DDL (data definition), ETL (extract transform load), DSL (domain specific languages, like Circe), and one-off SQL queries.

I am partial to the 56 table solution proposed by @jmethot. Naming conventions let us use meta-data to manage DDL and DSL contexts, while at the same time not burdening one-off SQL queries with indirection (at the cost of UNION ALL for combined cases). Would ETLs be easier or harder with EAV vs 56 tables?

jmethot · March 16, 2023, 9:21pm

@cce Thanks, Clark.

Just to be sure: I was floating either

~56 standard relationship concepts that represent all “minimally semantic” relationships between domains. (it wouldn’t be exactly 56 because Robert proposes some domain-to-same-domain concepts, and some domain-to-other-domain concepts probably don’t make sense)
Some number of carefully chosen standard relationship concepts with specific semantics that have analytical use cases such as those listed at the beginning of @rtmill’s original post in this thread. The trick is finding the goldilocks specificities that satisfy the analytical needs but are not so narrow that people eventually want thousands of them (that latter situation is what I imagine is making steam escape @Christian_Reich’s ears reading this thread).

Neither of these is EAV but using that metaphor, in these proposals the Es and Vs are fixed (the OMOP CDM domains) and we’re proposing a small initial set of standardized As.

Note that I don’t think @rtmill is proposing that existing (local, non-standard) uses of FACT_RELATIONSHIP would change, but that OHDSI tools would only recognize standard relationship concepts therein. If we wanted to make that explicit we could propose a new DOMAIN_RELATIONSHIP table to house only standard domain relationships.

cce · March 16, 2023, 9:29pm

@jmethot I’m even more supportive of using tables for specific kinds of data driven by concrete use cases. Even 56 generic tables may have multiple ways they are used, hindering their usefulness. Perhaps we should focus our energies on creating a schema/vocabulary module system and the community processes for ensuring that extensions are well designed and integrated. We could have a community owned continuous integration to ensure that we don’t have conflicts, etc. For DDL and DSL contexts, we could drive schema creation and generic querying with meta-data.

Andy_Kanter · March 22, 2023, 1:18am

Just wanted to connect the @Paul_Nagy presentation about medical imaging discussion where they are proposing using a new table image_feature which allows for linking of observations about an image, such as the size of the mass, etc. Since imaging has a specific instance/series and a location (on the image), things can be linked together over time. This is not generalizable to all fact relationships, but it was interesting.

Also recent discussions about mapping HPO concepts to OMOP raised the specter of more granular concepts than SNOMED (either due to missing SNOMED primitives or post-coordination). This discussion might also be relevant. @mellybelly

rtmill · May 10, 2023, 11:27pm

Apologies for starting this conversation and going silent shortly after. I was put out of commission with pneumonia and have been playing catch up since. The source of the infection is still unknown but the leading theory is too much time on the OHDSI forums.

I’ve recently had the opportunity to devote some time to this and as a result believe this is more relevant than I had previously thought, especially with the revival of THEMIS.

This isn’t exactly what I sat down to write but its what came out and at worst hope can make the relevance and potential opportunity more vivid. It is worth noting that I am likely biased, given a solution to this problem enables the oncology data jigsaw puzzle to more or less fall into place, but even yet I believe there is substance here.

I’d like to emphasize that the following is specifically regarding relationships between evidence that are persisted explicitly in the source data and not subjective associations.

I’ve provided a link to a slide deck (can’t attach ppt) that is more of a thought experiment of how this could work rather than a proposal for how it should work. That is likely best to be viewed last (if you make it that far) and there is a duplicate link towards the end.

Rabbit hole preamble (bear with me)

There is an influx of new data coming into OMOP at a seemingly accelerating rate. More sites, more sources, more types of sources, more variability in the detail and structure of sources. Both in breadth and in depth, but focusing on the latter
As the depth of data increases we are crossing over the limit of the current scope of adequate OMOP conventions. By adequate I mean to say that for a given piece of evidence in the source data, there is a singular standard target representation defined in OMOP. Inadequate could be defined as either a) no convention for that evidence or b) more than one possible standard convention, or representation in OMOP, for that evidence
When a site faces a gap in conventions it’s likely one of three outcomes: 1) they give up 2) they implement an ad hoc solution outside of established conventions or 3) they work with the community to create a new convention
Interoperability, and specifically the feasibility of network research, depends on adequate conventions
There are “general conventions” in OMOP that are foundational ( _TYPE_CONCEPT, _CONCEPT_ID & _SOURCE_CONCEPT_ID etc.). They define underlying patterns for expanding conventions in a standardized, systematic way. For example, any new table will have provenance defined by _TYPE_CONCEPT and the standard concept by _CONCEPT_ID, etc.
The FACT_RELATIONSHIP table in its current form is insufficient to facilitate a “foundational” convention, but what if it was? What if there were a mechanism for defining relationships between tables that had the same extensibility as the other “general conventions”? Hypothetically, the most general implication would be that any use case that requires novel relations between tables would either be a) defined within the scope of this foundational convention, or b) the foundational convention would define the pattern, the standardized extension, in which the new convention is created

Rabbit hole

Why isn’t the FACT_RELATIONSHIP mechanism currently sufficient? What is it missing?

Provenance - Where did the evidence of this relationship come from?
Content - What type of relationship is it?
- The field exists (RELATIONSHIP_CONCEPT_ID) but a sufficient vocabulary does not
Temporality - Is the relationship limited to a period of time? If so, what is it?

Why does it matter?

There are an expanding number of valid, likely impactful, use cases in which the relationships between entities, as it exists in the source data, cannot sufficiently be represented in OMOP within standardized conventions
These use cases are either being implemented in an ad hoc approach outside of conventions, or are being implemented by creating new conventions that each require modifications to the CDM
If we think of the spectrum of data in OMOP as simply entities and relationships between them, and if we had a foundational mechanism that handled relationships, only the use cases that create new entities (e.g. specimen, image, etc.) or modify existing entities would require changes to the CDM
Greater stability of the CDM facilitates interoperability and eases burden of tooling ecosystem (less versioning to accommodate)

Example implementation:

Slides link: OMOP table relation mock - Google Slides (can’t attach ppt I guess?)
As mentioned above, the slides are more of a thought experiment as to how this could work instead of a proposal for how it should
There are a few question marks in there - most notably the slides reference field_concept_id, as other conventions have leveraged, but it is unclear as to whether domain_concept_id is more appropriate

Regarding the inevitable “use cases??”, see the top of this thread for some examples. If helpful I can try to curate the extent of what I’ve come across thus far into a list

Patrick_Ryan · May 11, 2023, 2:49am

Thanks @rtmill for your very thoughtful post and for the concreate proposal for a revision to the FACT_RELATIONSHIP table. I think this proposal has merits, and I’d be eager to see you or others apply it to real data as a pilot to demonstrate its value so that we could consider promoting it more broadly.

MPhilofsky · May 13, 2023, 12:07am

Hello @rtmill,

You mention Themis, but this will fall under the CDM WG domain since it would most likely require a change to the model. Themis will definitely be available to help give feedback on proposed conventions and then help solidify any conventions needed for these data.

And I agree with @Patrick_Ryan :

The CDM WG is currently hosting presentations on proposed model expansions. You should work with @clairblacketer to get on the schedule!

rtmill · May 16, 2023, 6:04pm

Thanks @Patrick_Ryan . I’ll plan to give a draft proposal to the CDM group to get feedback before any sort of development/pilot effort.

@MPhilofsky Apologies on the confusion - I only meant that the THEMIS Revival™ was inspiring towards thinking about the underlying conventions more substantially

To append to the above proposal (per CDM group feedback): the field_concept_id references should instead be table_concept_id

rtmill · July 3, 2023, 6:30am

HI again.

I’d like to provide a nudge on this as it came up during the EU Oncology workshop.

Given the level of detail provided above, I’m looking for a suggestion on the next step? Is it to put together a presentation for the CDM group to provide initial feedback for developing a prototype? @clairblacketer And if so, any particular areas where we should be more specific than what was provided above?

Thanks again,

Christian_Reich · July 4, 2023, 1:27pm

@rtmill:

You have to write up the changes to the CDM and the list of relationship_ids you are thinking, create agreement in the WG, submit it to Clair and defend it in the CDM WG. When it is all sanctioned, blessed and ratified you submit the relationship IDs through @aostropolets’s community contribution and then wait for the next vocab release and CDM release.

clairblacketer · July 18, 2023, 8:05pm

Hi @rtmill I echo the sentiments and thanks for the thoughtful post. From here the next step would be to submit a written proposal as described here. You have more than enough content so you should be fine, this just helps us keep track.

Some thoughts I had as I was reading through:

Use of either approaches i.e. two-part foreign keys or FACT_RELATIONSHIP extension relies on two key things

You actually know the relationship between facts. This may sound silly and self-evident but many data sources are just a collection of facts with no detailed relationship information. Earlier it was mentioned “metformin to treat diabetes…” but more often, we can see the diabetes diagnosis and the metformin exposure. There is no explicit link there so we should not create one.
You can use these relationships to inform observational research. This could be a chicken-or-the-egg argument but I don’t understand how these relationships will play out in covariate creation or cohort creation. In my opinion this needs critical thought and methods development to fully flesh out. I am not saying that they are not informative because there is a reason the relationship is there, but I think there is much academic exploration to be done.

Thanks again and I am looking forward to your proposal.

jmethot · July 22, 2023, 4:27pm

I sat down to begin formalizing this as a CDM proposal, but in researching the NLP use case to include as an example, I came across mention of what appears to be an identical proposal from 2018, Proposal 2 or 3 here:

github.com/OHDSI/CommonDataModel

Fact_relationship changes

opened 01:01PM - 30 Oct 18 UTC

gowthamrao

Proposal

We decided to introduce concept_id's for each field in OMOP cdm. e.g. we will be… assigning a concept_id for person.person_id, visit_occurrence.visit_occurrence_id etc. To be released in future version of vocabulary. e.g. cost.cost_event_field_concept_id needs concept_id for each identify of the field. **Proposal 1:** This proposal is to change FACT_RELATIONSHIP table's domain_concept_id to field_concept_id, where field_concept_id is the identity of the field in omop cdm. Field | Required | Type | Description -- | -- | -- | -- ~domain_concept_id_1~ | ~Yes~ | ~integer~ | ~The concept representing the domain of fact one, from which the corresponding table can be inferred.~ field_concept_id_1 | Yes | integer | A foreign key identifier to a concept in the concept table representing the identity of the field of fact one. fact_id_1 | Yes | integer | The unique identifier in the table corresponding to the domain of fact one. ~domain_concept_id_2~ | ~Yes~ | ~integer~ | ~The concept representing the domain of fact two, from which the corresponding table can be inferred.~ field_concept_id_2 | Yes | integer | A foreign key identifier to a concept in the concept table representing the identity of the field of fact two. fact_id_2 | Yes | integer | The unique identifier in the table corresponding to the domain of fact two. relationship_concept_id | Yes | integer | A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies. Proposal 1: will help overcome domain ambiguity such as does the domain_id = 'Visit' represent visit_occurrence or visit_detail.

I can’t tell what became of that proposal. It was marked as In Progress in April 2019 but that’s the end of the thread.

@rtmill: You commented on it back at that time, Do you agree it’s the same proposal, perhaps just with specific relationship concepts needed?

@clairblacketer: Can you please tell me what became of @Gowtham_Rao’s proposal (Issue 230)?

cce · July 23, 2023, 7:56pm

I think this is an important topic as well. There is the orthogonal problem of user experience for building cohort queries, Circe and Atlas. I still prefer we stick with hundreds of smaller tables with concrete relationships. However, we may want associated meta-data so that tools can more easily work with those smaller tables.

A recent example from Tufts is that we want to know the admitting and discharge attending for a hospitalization. This is something we need to query as part of our cohort queries. Hence, it’s not simply about where to store this relationship, but how we could tell Circe to use it, and how it could be shown in Atlas. Hence, I propose a concrete solution for individual tables, however, we have a generic mechanism for declaring those tables and relationships are. Not anything elaborate, just focused on simple query requirements.

Christian_Reich · July 24, 2023, 6:53am

@cce:

This sounds like an Expansion to me. Which is something you add on top of the OMOP CDM and vocabularies. As a standard, it would violate our principles: (i) it’s not Closed World (you cannot list externally all attending physicians), (ii) it is not useful from the perspective of the network and (iii) it is really about patient care rather than research. So, feel free to arrange for a solution in any way that seems fit.