I think the biggest problem is that we are using relational-style tables to represent graph data. Yes, it can be done, but it creates so many levels of complexity. At the risk of being redundant, we need our vocabulary stored in a graph database.
Perhaps for network studies, use a latter version of Postgres, with the vocabulary stored as a graph structure. I realize this would create issues for SQL Server only users, as it has no good way to represent graphs (XML is never a good idea), but the multi database approach will always give inferior results.
I am attempting to not get on my lack of standards soapbox. (edit: I failed)
@Mark I agree with you that a graph database is a more natural fit, but adopting OMOP already requires so many areas of expertise that adding an entirely new database type that is foreign to 99% of the community is a bridge too far (at this point).
I was assuming if this got some form of a green light we could create an initial set of relationships based on the use cases weāre currently aware of and then poll the community to see if there are suggested additions. That would at least be a starting point for the vocabulary.
Perhaps your āgeneric relationshipsā, i.e. āthis domain to this domainā can serve as the top levels of hierarchies, where any more specific relationships (your āextensionā relations) that fall under that same āX domain to Y domainā relation is a child concept of the more generic. That way for tooling we can use the āinclude descendantsā if there are broad use cases that donāt care about how the two entities are related, just that they are?
If we are going to add this, can we do this with a unique relationship_id instead of the āmaps toā, please? This is already a problem in ETL world with all the āmaps toā over various domains.
@rtmill Thank you for this topic. I donāt think there is one right approach. An important consideration is to look at how each proposal fairs across various contexts, DDL (data definition), ETL (extract transform load), DSL (domain specific languages, like Circe), and one-off SQL queries.
I am partial to the 56 table solution proposed by @jmethot. Naming conventions let us use meta-data to manage DDL and DSL contexts, while at the same time not burdening one-off SQL queries with indirection (at the cost of UNION ALL for combined cases). Would ETLs be easier or harder with EAV vs 56 tables?
~56 standard relationship concepts that represent all āminimally semanticā relationships between domains. (it wouldnāt be exactly 56 because Robert proposes some domain-to-same-domain concepts, and some domain-to-other-domain concepts probably donāt make sense)
Some number of carefully chosen standard relationship concepts with specific semantics that have analytical use cases such as those listed at the beginning of @rtmillās original post in this thread. The trick is finding the goldilocks specificities that satisfy the analytical needs but are not so narrow that people eventually want thousands of them (that latter situation is what I imagine is making steam escape @Christian_Reichās ears reading this thread).
Neither of these is EAV but using that metaphor, in these proposals the Es and Vs are fixed (the OMOP CDM domains) and weāre proposing a small initial set of standardized As.
Note that I donāt think @rtmill is proposing that existing (local, non-standard) uses of FACT_RELATIONSHIP would change, but that OHDSI tools would only recognize standard relationship concepts therein. If we wanted to make that explicit we could propose a new DOMAIN_RELATIONSHIP table to house only standard domain relationships.
@jmethot Iām even more supportive of using tables for specific kinds of data driven by concrete use cases. Even 56 generic tables may have multiple ways they are used, hindering their usefulness. Perhaps we should focus our energies on creating a schema/vocabulary module system and the community processes for ensuring that extensions are well designed and integrated. We could have a community owned continuous integration to ensure that we donāt have conflicts, etc. For DDL and DSL contexts, we could drive schema creation and generic querying with meta-data.
Andy_Kanter
(Andrew S. Kanter, MD MPH FACMI FAMIA)
30
Just wanted to connect the @Paul_Nagy presentation about medical imaging discussion where they are proposing using a new table image_feature which allows for linking of observations about an image, such as the size of the mass, etc. Since imaging has a specific instance/series and a location (on the image), things can be linked together over time. This is not generalizable to all fact relationships, but it was interesting.
Also recent discussions about mapping HPO concepts to OMOP raised the specter of more granular concepts than SNOMED (either due to missing SNOMED primitives or post-coordination). This discussion might also be relevant. @mellybelly
Apologies for starting this conversation and going silent shortly after. I was put out of commission with pneumonia and have been playing catch up since. The source of the infection is still unknown but the leading theory is too much time on the OHDSI forums.
Iāve recently had the opportunity to devote some time to this and as a result believe this is more relevant than I had previously thought, especially with the revival of THEMIS.
This isnāt exactly what I sat down to write but its what came out and at worst hope can make the relevance and potential opportunity more vivid. It is worth noting that I am likely biased, given a solution to this problem enables the oncology data jigsaw puzzle to more or less fall into place, but even yet I believe there is substance here.
Iād like to emphasize that the following is specifically regarding relationships between evidence that are persisted explicitly in the source data and not subjective associations.
Iāve provided a link to a slide deck (canāt attach ppt) that is more of a thought experiment of how this could work rather than a proposal for how it should work. That is likely best to be viewed last (if you make it that far) and there is a duplicate link towards the end.
Rabbit hole preamble (bear with me)
There is an influx of new data coming into OMOP at a seemingly accelerating rate. More sites, more sources, more types of sources, more variability in the detail and structure of sources. Both in breadth and in depth, but focusing on the latter
As the depth of data increases we are crossing over the limit of the current scope of adequate OMOP conventions. By adequate I mean to say that for a given piece of evidence in the source data, there is a singular standard target representation defined in OMOP. Inadequate could be defined as either a) no convention for that evidence or b) more than one possible standard convention, or representation in OMOP, for that evidence
When a site faces a gap in conventions itās likely one of three outcomes: 1) they give up 2) they implement an ad hoc solution outside of established conventions or 3) they work with the community to create a new convention
Interoperability, and specifically the feasibility of network research, depends on adequate conventions
There are āgeneral conventionsā in OMOP that are foundational ( _TYPE_CONCEPT, _CONCEPT_ID & _SOURCE_CONCEPT_ID etc.). They define underlying patterns for expanding conventions in a standardized, systematic way. For example, any new table will have provenance defined by _TYPE_CONCEPT and the standard concept by _CONCEPT_ID, etc.
The FACT_RELATIONSHIP table in its current form is insufficient to facilitate a āfoundationalā convention, but what if it was? What if there were a mechanism for defining relationships between tables that had the same extensibility as the other āgeneral conventionsā? Hypothetically, the most general implication would be that any use case that requires novel relations between tables would either be a) defined within the scope of this foundational convention, or b) the foundational convention would define the pattern, the standardized extension, in which the new convention is created
Rabbit hole
Why isnāt the FACT_RELATIONSHIP mechanism currently sufficient? What is it missing?
Provenance - Where did the evidence of this relationship come from?
Content - What type of relationship is it?
The field exists (RELATIONSHIP_CONCEPT_ID) but a sufficient vocabulary does not
Temporality - Is the relationship limited to a period of time? If so, what is it?
Why does it matter?
There are an expanding number of valid, likely impactful, use cases in which the relationships between entities, as it exists in the source data, cannot sufficiently be represented in OMOP within standardized conventions
These use cases are either being implemented in an ad hoc approach outside of conventions, or are being implemented by creating new conventions that each require modifications to the CDM
If we think of the spectrum of data in OMOP as simply entities and relationships between them, and if we had a foundational mechanism that handled relationships, only the use cases that create new entities (e.g. specimen, image, etc.) or modify existing entities would require changes to the CDM
Greater stability of the CDM facilitates interoperability and eases burden of tooling ecosystem (less versioning to accommodate)
As mentioned above, the slides are more of a thought experiment as to how this could work instead of a proposal for how it should
There are a few question marks in there - most notably the slides reference field_concept_id, as other conventions have leveraged, but it is unclear as to whether domain_concept_id is more appropriate
Regarding the inevitable āuse cases??ā, see the top of this thread for some examples. If helpful I can try to curate the extent of what Iāve come across thus far into a list
Thanks @rtmill for your very thoughtful post and for the concreate proposal for a revision to the FACT_RELATIONSHIP table. I think this proposal has merits, and Iād be eager to see you or others apply it to real data as a pilot to demonstrate its value so that we could consider promoting it more broadly.
You mention Themis, but this will fall under the CDM WG domain since it would most likely require a change to the model. Themis will definitely be available to help give feedback on proposed conventions and then help solidify any conventions needed for these data.
Thanks @Patrick_Ryan . Iāll plan to give a draft proposal to the CDM group to get feedback before any sort of development/pilot effort.
@MPhilofsky Apologies on the confusion - I only meant that the THEMIS Revival⢠was inspiring towards thinking about the underlying conventions more substantially
To append to the above proposal (per CDM group feedback): the field_concept_id references should instead be table_concept_id
Iād like to provide a nudge on this as it came up during the EU Oncology workshop.
Given the level of detail provided above, Iām looking for a suggestion on the next step? Is it to put together a presentation for the CDM group to provide initial feedback for developing a prototype? @clairblacketer And if so, any particular areas where we should be more specific than what was provided above?
You have to write up the changes to the CDM and the list of relationship_ids you are thinking, create agreement in the WG, submit it to Clair and defend it in the CDM WG. When it is all sanctioned, blessed and ratified you submit the relationship IDs through @aostropoletsās community contribution and then wait for the next vocab release and CDM release.
Hi @rtmill I echo the sentiments and thanks for the thoughtful post. From here the next step would be to submit a written proposal as described here. You have more than enough content so you should be fine, this just helps us keep track.
Some thoughts I had as I was reading through:
Use of either approaches i.e. two-part foreign keys or FACT_RELATIONSHIP extension relies on two key things
You actually know the relationship between facts. This may sound silly and self-evident but many data sources are just a collection of facts with no detailed relationship information. Earlier it was mentioned āmetformin to treat diabetesā¦ā but more often, we can see the diabetes diagnosis and the metformin exposure. There is no explicit link there so we should not create one.
You can use these relationships to inform observational research. This could be a chicken-or-the-egg argument but I donāt understand how these relationships will play out in covariate creation or cohort creation. In my opinion this needs critical thought and methods development to fully flesh out. I am not saying that they are not informative because there is a reason the relationship is there, but I think there is much academic exploration to be done.
Thanks again and I am looking forward to your proposal.
I sat down to begin formalizing this as a CDM proposal, but in researching the NLP use case to include as an example, I came across mention of what appears to be an identical proposal from 2018, Proposal 2 or 3 here:
I canāt tell what became of that proposal. It was marked as In Progress in April 2019 but thatās the end of the thread.
@rtmill: You commented on it back at that time, Do you agree itās the same proposal, perhaps just with specific relationship concepts needed?
I think this is an important topic as well. There is the orthogonal problem of user experience for building cohort queries, Circe and Atlas. I still prefer we stick with hundreds of smaller tables with concrete relationships. However, we may want associated meta-data so that tools can more easily work with those smaller tables.
A recent example from Tufts is that we want to know the admitting and discharge attending for a hospitalization. This is something we need to query as part of our cohort queries. Hence, itās not simply about where to store this relationship, but how we could tell Circe to use it, and how it could be shown in Atlas. Hence, I propose a concrete solution for individual tables, however, we have a generic mechanism for declaring those tables and relationships are. Not anything elaborate, just focused on simple query requirements.
This sounds like an Expansion to me. Which is something you add on top of the OMOP CDM and vocabularies. As a standard, it would violate our principles: (i) itās not Closed World (you cannot list externally all attending physicians), (ii) it is not useful from the perspective of the network and (iii) it is really about patient care rather than research. So, feel free to arrange for a solution in any way that seems fit.