TLDR: The vocabulary is graph data and should be stored as such. The current recursive lookups are both slow and complex to write general purpose queries to find all matches.
My job is to build an automatic ETL of our EHR to OMOP instance. My frustration is that EHR data, outside of the actual demographic information, is not relational data, it is document data. The vocabulary data is not relational either, it is graph.
Whist all of the EHR and Vocabulary data can be modeled in a graph structure or a relational structure, both take more steps and add complexity, this is why I suggested OrientDB as it allows one to use a Graph-Document hybrid model. This would mean that data can live in it’s natural environment. This would allow more logical searches of data and easier ETL.
We( Cherokee Health Systems) are one of the primary grant receipts for AOU, this means that we have to run the entire ETL process every 6 weeks ( it appears that this is going to move to a faster iteration). This is a tremendous amount of data to reprocess on a regular basis.
As to the speed comment I made, I was referring to OrientDB, according to benchmarks, should be faster than NEO4J as to it allows Graph transversals from one node to another or direct lookups using a an index much like standard relational systems. @ Christian_Reich I am sorry, I was not clear on that point.
I realize that our pain point is not the same that most will have; I do not expect the entire process to change to fit into our needs. I am encouraged that others are interested in cleaning up the process.