I’m developing a new tool designed to dramatically accelerate the conversion of source databases to the OMOP CDM, leveraging the power of large language models (LLMs). The goal is to minimize the time from gaining access to a source database to generating the first OMOPified result—typically within 30 minutes, including an initial Data Quality Dashboard (DQD) run.
Key Features:
LLM-First Approach:
We use an LLM trained to understand a our custom syntax for structural mapping. This allows users to describe mappings using natural language prompts.
Mapping as a code Approach:
We developed declarative special language, optimized for the structural mapping (alike terraform) to get all benefits from coding: collaboration, version control, code as documentation, tests, possibility to use LLM
Declarative Mapping Syntax:
Each OMOP field is defined with an elementary SQL expression for data extraction. Fields can also be flagged for semantic mapping.
Automated Semantic Mapping:
The system performs initial vocabulary (concept) mapping automatically. Users can then review and fine-tune the mappings as needed.
Security by design: sends only statistical information to backend (like WhiteRabbit report) and may work with synthetic version of database and produce ETL to be applied to real data
Outcome:
Rapid onboarding: From database connection to initial OMOP CDM + DQD results in ~30 minutes.
Iterative refinement: Quickly adjust mappings and improve output over time.
I’m currently looking for test users who are willing to try the product for free and provide feedback. If you’re working on an OMOP conversion project and interested in reducing the time and effort required, I’d love to hear from you!
Yes, maps end-to-end. And full process from scanning source database, structural mapping few tables, semantic mapping and finally running DQD takes typically 30 minutes. And then each iteration add more and more tables until finish
Just wanted to add, we had a fantastic meeting with Artem. During it, I explored the functionality of the tool. From what I learned, the platform provides automated schema mapping between datasets, with a supervisor mode for manual review and corrections where needed. This can reduce the manual overhead involved in harmonizing heterogeneous biomedical datasets. AI tools work great under the hood; sometimes, they need some course correction, but they help with 90+% of the work, allowing us to focus on the course correction instead of starting from scratch. It was amazing.
Looking forward to the new release, to test it. @ents great work!
My experience using ChatGPT 4.0 to map ICD codes to SNOMED was really a dead end. Most of the recommended mappings were wrong, sometimes ridiculously so, as I recall. I gave up after a few tries. Granted, this was not schema mapping as described here, but I would check semantic mapping carefully if you are using an LLM to help with OMOP conversion.
We have complex algorithm of semantic mapping, including LLM and classic search (like Usagi or athena)
I’m happy to show you our suggester and we may check accuracy on your examples
And we use LLM not only for semantic, but for structure also and this is noval. LLM analyses source table and fields and makes relation between source tables and fields and CDM
quite interesting, happy to have a look and provide feedback https://www.linkedin.com/in/albertolabarga/ we have several ongoing projects to transform data to OMOP-CDM in federated settings