Representing OMOP in a graph database

lrasmussen · April 30, 2020, 2:41pm

I am collaborating on a project that is looking at representing OMOP patient data in Neo4J, and we were hoping to learn experiences from others in the OHDSI community that have done this already. In particular, lessons learned with different models for nodes/edges/attributes.

I came across a 2017 software demonstration abstract (https://www.ohdsi.org/web/wiki/lib/exe/fetch.php?media=resources:jose_alvarado_rd2gd_ohdsi_submission_2017.pdf) that described this general approach, but after looking at the RD2GD repository on GitHub (https://github.com/Sapphirine/RD2GD), it seemed that while this was demonstrated for clinical data, it is not in the OMOP format.

I’ve seen a few other posts where graph databases are mentioned, and was hoping to see what experiences the community might have and be willing to share.

keesvanbochove · April 30, 2020, 6:40pm

Dear Luke,

If you have a lot of OMOP type data I’m not sure if a graph database would add much value, in fact you are probably better off with an RDBMS or possibly OLAP database as these are more optimized for the types of queries you would do on OMOP data. But of course it’s technically possible to express the OMOP model in Neo4J.

That being said, if you also have other type of data to integrate this could be very interesting. Please check out the webinar from my colleague Ilaria Maresi last week where she discussed a graph model for clinical trials data and also answers a question about the relation to OMOP - basically I think you can have high volume medical history data in OMOP but interlink it with a broader context in a graph database.

The webinar recording is at https://vimeo.com/411366798 and slides are here: https://www.slideshare.net/pistoiaalliance/knowledge-graphs-ilaria-maresi-the-hyve-23apr2020.

Greetings,

Kees

Mark · April 30, 2020, 7:17pm

Thank you for looking into this. I have personally found frustration after frustration of mapping non-relational data into a relational data structure( if all one has is a hammer, everything looks like a nail).

Are you wedded to Neo4J? The reason I ask is that OrientDB can run in two different modes: Pure Graph or a Graph/Document hybrid. The OMOP data looks like it would map easier into a Graph/Document hybrid structure; allowing for, theoretically, faster lookups as well. I am not for sure how this would affect graph traversals though.

Christian_Reich · May 2, 2020, 6:27am

@Mark and @lrasmussen:

Do you have a particular use case you want to develop by using graph databases? I understand your frustration since this subject has been brought up over and over again and nothing seems to move forward, but you must have some purpose. You mentined “theoretically, faster lookup”. Lookup of what?

Mark · May 4, 2020, 2:01pm

TLDR: The vocabulary is graph data and should be stored as such. The current recursive lookups are both slow and complex to write general purpose queries to find all matches.

Details:
My job is to build an automatic ETL of our EHR to OMOP instance. My frustration is that EHR data, outside of the actual demographic information, is not relational data, it is document data. The vocabulary data is not relational either, it is graph.
Whist all of the EHR and Vocabulary data can be modeled in a graph structure or a relational structure, both take more steps and add complexity, this is why I suggested OrientDB as it allows one to use a Graph-Document hybrid model. This would mean that data can live in it’s natural environment. This would allow more logical searches of data and easier ETL.
We( Cherokee Health Systems) are one of the primary grant receipts for AOU, this means that we have to run the entire ETL process every 6 weeks ( it appears that this is going to move to a faster iteration). This is a tremendous amount of data to reprocess on a regular basis.

As to the speed comment I made, I was referring to OrientDB, according to benchmarks, should be faster than NEO4J as to it allows Graph transversals from one node to another or direct lookups using a an index much like standard relational systems. @ Christian_Reich I am sorry, I was not clear on that point.

I realize that our pain point is not the same that most will have; I do not expect the entire process to change to fit into our needs. I am encouraged that others are interested in cleaning up the process.

lrasmussen · May 5, 2020, 2:21pm

Thanks all for your replies and helpful resources!!

@keesvanbochove - I will definitely check out that presentation, thank you! And we are integrating other types of data along with the OMOP, sorry for not clarifying that.

@Mark - the team I’m working with has past experience working with Neo4J so I think we will be progressing with that. But thank you for the suggestion of OrientDB. I’ll mention it to the team and investigate some myself.

@Christian_Reich - we’re hoping to be able to traverse and discover more complex relationships/patterns across a graph structure, but I certainly appreciate your question “why?” We are tracking performance and fit to purpose of this as we progress with the project, but getting started with some hands-on experience will help us validate it further.

Andrew · May 5, 2020, 2:34pm

I suspect that the current work in N3C to link COVID data in OMOP form to the knowledge bases and ontologies used in NIH Translator projects will build some of what’s needed to do the work you are interested. Whether OMOPed vocabs get represented in Neo4J or not, this work will enable analyses that require a graph representation to utilize clinical data in OMOP form. It’s evolving work but I think has a variety of important use cases of interest to the OHDSI community. Other at your shop are involved. It would be great to keep on top of its implications and value for OHDSI and help this community decide whether and how push mature solutions developed in N3C and Translator toward new or extended OHDSI tools, CDM modules etc.

mpreusse · July 16, 2020, 8:49pm

@lrasmussen: I’m fairly new to OMOP CDM but I have extensive experience with Neo4j and I build Neo4j applications for all kinds of use cases in medical research. My company is currently on track to become a certified SME in the EHDEN project (http://ehden.eu/).

I started loading the standardized vocabularies to Neo4j, this really is a fantastic ressource. I plan to continue with other parts of the CDM.

@Christian_Reich: I think there are multiple reasons to use graph databases: i) Once you want to integrate with other data sources a graph database is easier to handle and extend ii) the ancestry in vocabularies is a graph, exploring it in a graph database is much more intuitive iii) there is a really good ecosystem for data visualization and exploration around Neo4j.

My understanding of the CDM is still limited, I’m just getting started. But from what I have seen so far it should be feasible to develop a ‘standard mapping’ of OMOP CDM to Neo4j.

Christian_Reich · July 20, 2020, 3:23am

@mpreusse:

I am fully ears, really want to understand. I get the browsing and navigation. And I get the integration. The former is not really relevant in a remote network setting (you cannot see patient data, you can only query them with the intent of creating summary statistics). The latter one is clear, even though we have a pretty good approach using the non-graph approach.

But the main use cases we currently create insights on how to treat patients:

Characterization what’s going on (stratifications of patient populations, timing, sequence of things over time)
Population-level estimation (measure the effect of one factor or intervention on an outcome by using a population with that intervention and comparing it to another one, with all the tricky stats necessary)
Patient-level prediction (measure the effect of all factors in all members of a populations on an outcome).

The graphs don’t seem to help with this. I think. Is there another strategic application?

mpreusse · July 20, 2020, 11:42am

@Christian_Reich:

You are right of course: In the context of a fully federated remote network setting, there is no immediate use case for a graph database. It would not make sense to replace the relational database (because of the ‘common’ in CDM) and without direct access to all data it is not possible to pull everything into a graph.

I would argue the following: Different data storage paradigms and their associated ecosystems open routes for different kinds of applications. Let’s for the sake of the argument assume that it is possible to pull all data into a graph database. This would allow to apply graph algorithms and analysis to all three of the main use cases you describe. Clustering for stratification, weighted ranking algorithms for population-level estimates and graph neural networks for patient-level prediction. In theory, it is possible to get all the data from OMOP CDM, create networks with some R package and run the analysis. From a practical perspective, it’s much easier to do it with data in a graph database.

I don’t really know how a graph database fits into the overall architecture of OHDSI/OMOP CDM. It’s mostly my general curiosity and my personal preference for graphs

I actively work on modeling the standardized vocabularies in Neo4j. I’ve been using different types of medical terminology as part of Neo4j applications for years. I have always struggeled with versioning and mappings. Outdated mapping projects, incomplete mappings and all those problems. The vocabularies of OMOP CDM are a fantastic ressource that somehow fly under the radar, i.e. you have to read into the documentation to find out about them.

I’m not sure if it’s possible to use them (licensing, terms etc) but they would definitely benefit many graph based data integration projects.

trberg · December 11, 2020, 8:35pm

@Christian_Reich @mpreusse @lrasmussen

Has there been any progress or more thought put into a graph database for OMOP concepts? We have a use case for N3C where we’d like to have a graph representation vs a relational database, specifically we need to apply a version of the minimum spanning tree algorithm on the concept network. Before I start attempting to build out the graph representation, I wanted to see whether other folks have already done some or all of this work since July that we can leverage.

gregk · December 12, 2020, 3:27pm

@trberg Odysseus worked with MarkLogic a few years back and successfully re-created OMOP CDM as a graph. It is actually not that challenging to do it, and it was a very interesting POC. The real challenge is that once you have done it, none of the existing OHDSI methods and tools - which are all relying on SQL - would work. Even though MarkLogic experimented with enabling SQL on a top of their solution at the time, the SQL support was a bit more limited that required by OHSDI.

Re-creating OMOP in a document style was very interesting - PERSON as a document was the concept we experimented with. Document format does offer an interesting combination of data harmonization plus some flexibility for not yet harmonized domains.

trberg · December 14, 2020, 4:35pm

@gregk Thank you for the information, this is very helpful!

kmoralesml · December 15, 2020, 2:32pm

To be precise, we successfully modeled concepts as documents, and beyond that, expressed the linkages between concepts intra and inter-vocabulary through the use of semantic triples.

lrasmussen · December 15, 2020, 4:03pm

@trberg - we have made some progress since July. We’re still working on finalizing scripts, etc., but happy to share and collaborate on what’s out there:

@gregk - thanks for sharing about the previous work in this space. Is any of this work available? We’d love to build upon any lessons learned.

mpreusse · December 16, 2020, 2:52pm

Thanks @lrasmussen for the github link!

I continued loading concepts into Neo4j, maybe I can share something beginning of next year.

@trberg would be nice to catch up next year to discuss ideas. Maybe a call with everyone here? There seems to be some interest in OMOP<->graph.

Dermot_Doyle · December 17, 2020, 9:23pm

“I have always struggeled with versioning and mappings. Outdated mapping projects, incomplete mappings and all those problems.”

That’s something I can help with, contact me!

Mkang1204 · December 18, 2020, 3:46am

Thanks @mpreusse and everyone for the nice comments!

I’m working with @lrasmussen on this project and we have loaded about 400 patients’ OMOP data into Neo4j and at the beginning stage of applying graph algorithms. I’m happy to join the call and share some lessons learned too. We also have some problems generating the accurate visit_occurrence sequence to build the patient journey and would love to hear some good practices in applying graph algorithms and graph embeddings

Ben_Hughes · January 8, 2021, 11:04am

We do some work ontologies from OHDSI/OMOP within graph and NLP engines.

It tends to be for use cases where we are using data types in conjunction with OMOP that are not in OMOP or structured format. We don’t have scaled client use at this point in time though.

We do however to a lot of work with NLP to load knowledge graphs and provide semantic resources for that - which I saw was a need on one of the threads. Contact the linguamatics team.

gabriellopes · June 7, 2022, 4:03pm

Hey, hello!

@lrasmussen , please, did your group get some lessons learned, benefits, etc., related to using Graphs for representing OMOP concepts?

I’m trying to automatize some configuration steps needed to have WebAPI and other OHDSI tools working properly, without several tutorials sometimes confused among them, and having the patient data level also being represented as a knowledge graph would add so much value to the analysis ecosystem.

Thank you in advance!