OHDSI Home | Forums | Wiki | Github

ETL tools for static clinical studies

Hello. I’m working on a set of mappings and related implementation for representing clinical study data in OHDSI. I’m interested in what tools people have used to implement the mapping. Since I’ll be doing ETL for a number of studies, I’d like to represent the mappings in a table rather than code. I hope to win both in terms of documentation and reduced duplication of effort. My background is in software development, though this is the closest I’ve gotten to ETL work. I’m toying with a little bit of Python and SQL to get my arm around this. At the same time I’m reading about tools like Kettle to see if there is some great leverage to be found there. I’ve also had a look at some of the documentation for the Janssen CDM v5 sample ETL. While my respect for the logical mapping is growing, what are people using to actually move the data?

thanks
-Chris

@CRoeder:

Maybe it’s a good thing that “My background is in software development, though this is the closest I’ve gotten to ETL work”. Because many people have talked about it, but nobody has actually done anything. This should probably be a subject for the CDM Working Group. Why don’t you write up a simple proposal, add it to the list and collect like-minded people to help you? And you will have a lot of overlap with the folks working on surveys, and you may want to steal a lot of the ideas from ADaM.

As far as I can tell, you would need to model aspects of clinical trials that the current OMOP CDM doesn’t have:

  • Study design
  • Randomization
  • Vocabulary for experimental treatments (most such drugs are not on the market yet, and therefore have no concept)
  • CRFs and the fact that data are recorded even when they are missing or negative. The current CDM model doesn’t do that.

(Still not scared?) :smile:

Christian,

Thanks for the detailed reply.

Are you aware of ETL implementations using available tools like Kettle, or is it often locally developed SQL?

Thanks for the pointers to more fully developing the standards. I’ve heard it said being a little scared is a good sign. :wink: I’ll pursue the proposal.
-Chris

I have seen people try another ETL software package like Kettle, but it was as much work as hand coding program in SQL or some other language. Hence, I believe most people are writing custom code. The OHDSI ETL of synthetic Medicare public use data used Python. I am sure that @DTorok, @Dymshyts, and @donohara can comment broadly on their approaches.

I will also mention that we use R and the data.table package, and we create a specification sheet in a set of CSV files that automates the ETL. We have used it for OMOP version 4 and for our own data model and to convert this to Sentinel. Given that R is perhaps the least likely approach, it should be possible to do ETL in almost any language you are comfortable with.

We do custom coding for each ETL. We have some overall approaches used for
each ETL such as how we break up the ETL into separate procedures, log the
steps, and QC routines. We often work on the customers site using their
database systems. For a database that does not support procedures we might
use something like Kettle to run different modules, but it is used only as
a scheduler, not used for any transformations.

Thanks Mark.

I’m new to R. I’m not surprised it’s useful here. Your code would be an interesting read.

My reading so far about Kettle has given me the impression that it has a very wide focus that includes making building the process available to non-programmers. There may not be a magic silver bullet hiding there for dealing with value transformations and the melted or pivoted observation and measurement tables. My approach so far is to have a table describing the mapping from source column/table to OMOP concept/column/table. This might be similar to your CSV specification sheets. I include mapping to a python function for value transformations.

My surprise that I don’t see much said or written about ETL code may evolve in to surprise at the data modeling challenge, as Christian has hinted.

thanks again
-Chris

I should also ask a few questions:

What is a CRF? (conditional random field???) I’ve bumped into missing data and Nulls. Death has a date, but still-living doesn’t. I’ve seen more simple cases like height only being recorded in the first visit and left null in later visits.

-Chris

@CRoeder:

CRF: Case Report Form. That is the standard definition of what data have to be collected for patients participating in a clinical trial (aka study subjects). So, if an ordinary patient will get his or her blood pressure measured when the nurse or doctor feels it’s necessary, the CRF prescribes when and what happens during a visit. There might be missing data though.

Hi everyone. I am volunteer at OpenEMR and, along with colleagues, we are helping in the analytics module. I believe we could help here, as well to exchange knowledge for the feature we are developing

My team is working on vocabularies ETL. And we make custom SQL-scripts for any new vocabulary. Nothing special, you know:)
Maybe it would be nice to add @TBanokina, @IYabbarova to this discussion, who are also the part of Odysseus team and work with CDM ETL

t