As you might have heard, we are introducing community contribution that enables community members to inject their local vocabularies and mappings into the OHDSI Standardized Vocabularies.
Apart from source vocabularies, there seems to be an interest in creating/having a knowledge base of all sort of local coding schema/free-text entry mappings to standard Vocabularies terms. I know @MPhilofsky is especially interested in it and has received some requests and potential contributions. While I don’t want to hijack this initiative, there is a question I’ve been wrapping my head around:
If you have such a knowledge base on your side (even a small one for your own ETL like source_to_concept_map), do you keep any meta-data? Versions, for example? Reasons for mapping changes? Time stamps? Something else?
From an Ontology and NLP perspective, I find the topic incredibly interesting.
I’d like to share some things I have come across:
Some time ago, I initiated a discussion analyzing collaborative tools alongside some conventional metadata and versioning: Link to Discussion. Specifically, I regard Gra.fo as an exceptional tool for developing mappings in a collaborative environment with Stanford Protégé being a good non-web based option.
@lee_evans has recently developed a branch of Broadsea incorporating the Data Build Tool (DBT), known as Broadsea-Dbt. DBT is a widely used data engineering tool that aims to streamline data transformation processes, making them akin to software development in terms of version control, automation, and documentation.
This blog post by Emily Riederer significantly piqued my interest: Link to Blog Post. Riederer has also released R packages like dbtplyr and convo that effectively put these concepts into practice.
Based on my experience, RDF is a widely accepted standard model with Turtle, SPARQL, or JSON-LD as common syntaxes. From a readability perspective, I have found LinkML and JSON-LD to be preferable.
There’s an active group known as Ontolog which could be highly informative for anyone interested in delving deeper into Ontology Engineering.
The Monarch Initiative is currently developing OntoGPT that could be beneficial in developing mappings (Thank you to @Andrew for introducing me to their work.)
At the University of Colorado, we have been mapping our source values to standard concept_ids for 6 years. During this time, I have learned a lot about the source system, Epic Caboodle, pros/cons of different ways to map (STCM vs inserts to the Concept/Concept Relationship tables), and OHDSI supported vocabularies. I’ll share my experiences here and hope others do the same. The semantic mapping process requires a LOT of human resources and creates mapping & knowledge debt for healthcare systems. I was quite naive to this when we first started on the custom mapping journey. Simpler is much better!
Versions: Every time I add mappings to our semantic mapping file, I copy the file, insert new mappings and version it. We use numbers to version it (currently on v121), but if I were to start over, I would use dates to version it because dates have more meaning than numbers. But, it doesn’t make enough difference to update my versioning process. I haven’t found a need for the dates. And the files are stored with the date as metadata.
Reasons for mapping changes: I had thought about replicating the OHDSI Concept table and including an invalid_reason field. I don’t know why I didn’t add this field to the semantic mapping file, but I didn’t. And it isn’t missed. For observational research purposes, does it matter if the source changed the value set, SNOMED deprecated a code or OHDSI Vocab team de-standardized a concept? What matters is updating the mapping to a standard concept_id, so the data are available for all researchers.
We do have a ModifyDate field which contains the date the mapping was modified. This is helpful for end users who receive regular extracts of data from us.
Time stamps: Yes, we have CreateDate field. This contains the date the mapping was created. This is helpful for end users who receive extracts of data from us.