OHDSI Home | Forums | Wiki | Github

Seeking Tool for Simplified Data Extraction and Transformation from OMOP CDM

We are encountering challenges in providing OMOP datasets to physician researchers. The feedback consistently highlights the complexity of the OMOP Common Data Model (CDM) structure as a barrier. Our aim is to find a solution that enables the extraction of data, specifically for a subset of patients identified via ATLAS, into a more user-friendly, flat file format.


We seek a tool or method capable of:

  • Simplifying the extraction process from the OMOP CDM.
  • Allowing for the input of a patient list from ATLAS.
  • Enabling selection of specific dimensions and measures, in compliance with IRB approvals.
  • Generating streamlined, flat files (e.g., CSV, Excel) with incorporated vocabulary translations for ease of interpretation by non-technical users.

Key Requirements:

  • The solution should cater to physician researchers with limited informatics background.
  • It must uphold data privacy and de-identification standards to ensure regulatory compliance.


  • Does anyone have experience with or know of tools or methods that meet these criteria?
  • Are there recommendations for resources or approaches within the OHDSI ecosystem or elsewhere that could facilitate this transformation?

Any insights, advice, or guidance from the community would be greatly appreciated. We aim to make OMOP data more accessible for our physician researchers, enhancing their ability to conduct meaningful statistical studies without the need for deep technical expertise.

Thank you for your support and collaboration.

Hello @pflugg,

This is a common lament I hear from healthcare systems who provide datasets to their physician researchers. Many health systems have found employing a dozen Bi-devlopers or report writers to pull one off datasets is not scalable, so they “OMOP” their data. But this also brings the challenges you describe and a few others.

The following also trip up the non-informatics professional:

  • There is the domain change for some source codes. Not all ICD codes are found in the Condition table.

  • The CPT4 code found in the Procedure table last year is now located in the Observation table this year.

  • Some source codes might map to > 1 standard concept.

  • There is the perennial problem of US researchers who speak in ICD9CM and ICD10CM language, which is not standard in the OMOP CDM.

The University of Colorado is actively working on a program comprised of education, dataset derivatives, processes and tools to alleviate the pain of transitioning from delivering Epic datasets to OMOP datasets. However, it will still require Bi-developers to take a customer’s data request, transform it to SQL, pull the data, review the dataset to ensure it meets the customer’s and compliance requirements, then deliver the dataset in an approved format. We envision this process to be less labor intensive than our current process of writing SQL for every new request.

We have tried using Leaf, but it doesn’t meet our needs. Atlas is good, but it doesn’t allow download.

Dataset derivatives

  • Our dataset derivatives start with a full PHI dataset and then we employ different methods to create LDS and De-Identified views. We have also extended our standard OMOP tables to include the code, code type (aka vocabulary), and description for every concept_id. We also create data marts derived from the OMOP CDM for some projects. We’ve also changed column names to be more investigator friendly.

Processes & tools

  • We have created an excel workbook for customers to define cohort criteria including inclusion & exclusion criteria; time frames; table and field variables of interest; table and field criteria. These are based upon the most commonly requested variables. We are putting the customer in charge of their data request and trying to gain back time spent in 1:1 meetings with customers defining which data elements they need.

  • We have created a data mart derived from the OMOP CDM which matches the standard project requirement workbook to ease the developers translation from the customer excel input to pulling the data from the tables.

  • We have created two dynamic mapping tools. The stored procedure is a SQL function to take the input of a concept_id or code/code type combination and map it to a standard concept_id. The stored procedure also allows the inclusion of child concepts. And returns the domain for the concept. The stored procedure has a companion process for recurring datasets that flag concepts which have changed domains or been mapped to another concept_id since the last time the dataset was pulled.

  • We employ a Tableau based data exploration tool which is similar to Atlas in its code search function. The user can search for codes or concepts of interest, review hierarchies, view patient record counts and descendant record counts. Data can’t be downloaded from the tool, but codes and concepts of interest can be. This tool sits on top of our De-Id OMOP and patient counts < 10 are masked as ‘< 10’.


  • This is a two part process and a work in progress. We have to educate the Bi-developers who have been working in Epic for a long time and we also have to educate the users of the datasets. We are creating documentation, educational videos and will be holding “office hours”.

The Healthcare Systems Interest Group meets every other Monday at 10am Eastern time to support health systems on their OHDSI journey. Please join us. Today is one of the Monday’s we meet. I would be happy to host this topic during one of our future calls.

^I agree with this; I am a programmer, report writer, and ETL’er, in that order. Much of the custom analysis, that I do, could not be done inside of the OMOP data structure.

EDIT: some could not be done, the rest would be much harder, to do inside OMOP structure.