I’m considering moving and storing OMOP data in Hadoop. Has anyone done this type of work before? I’d love to hear from your experience and approach. What schema has worked for you?
Do you have your data in an OMOP data model in an RDBMS? If so, you might look at leveraging [Apache Sqoop] to extract this data from your RDBMS and load it to Hadoop.
I think your suggestion is a great idea. As it turns out, my goal and situation is bit different. I already have imported the data into Hadoop. Any tools or ideas on how I can extract an OMOP Common Data Framework from that data that already is stored in Hadoop? Thanks.
Hi @Pghavami! Currently, our whole architecture is built on RDBMS (mainly SQL Server, Oracle and PostgreSQL, but unofficially also supporting Amazon Redshift, Microsoft PDW, and hopefully soon IBM Netezza).
We (or at least I) haven’t looked yet at Hadoop, simply because we had no use case where it made sense. Would you be able to tell a bit more about why you’re using Hadoop? What is the task you’re trying to achieve?
@Pghavami, you can use Sqoop both ways - in and out of a hadoop cluster to extract from as well as load into a hadoop cluster.
From the Sqoop User Guide
Table 17. Export control arguments:
Argument Description --direct Use direct export fast path --export-dir <dir> HDFS source path for the export -m,--num-mappers <n> Use n map tasks to export in parallel --table <table-name> Table to populate --update-key <col-name> Anchor column to use for updates. Use a comma separated list of columns if there are more than one column. --update-mode <mode> Specify how updates are performed when new rows are found with non-matching keys in database. Legal values for mode include updateonly (default) and allowinsert. --input-null-string <null-string> The string to be interpreted as null for string columns --input-null-non-string <null-string> The string to be interpreted as null for non-string columns --staging-table <staging-table-name> The table in which data will be staged before being inserted into the destination table. --clear-staging-table Indicates that any data present in the staging table can be deleted. --batch Use batch mode for underlying statement execution.
Example from Instant Apache Sqoop:
$ bin/sqoop export --connect jdbc:oracle:thin:@localhost:1521:db1 --username username -password password --table tableName --export-dir /user/username/tableName
Key parameters are:
–export-dir is the location where the source data files are on hadoop HDFS.
–table-name is the name of the target table in your database.
–update-key is the compound key that defines unique rows.
–update-mode allowinsert If you want to insert missing rows in target table.
For Hive you would export the data from Hive using a query appropriate for populating the specific Oracle table. Then the above from where your Hive export files are located.
You can Also use Oozie to manage this process.
We have a modest hadoop cluster and are looking at loading our OMOP data into it next month. We plan on mostly loading it via Sqoop + Oozie into our HBase cluster. We have used that solution with an NLP / SOLR platform we have developed and very pleased with the performance and flexibility. In the case of the OMOP data the key use cases will be more of a challenge in setting the tables and compound keys. With scaled / distributed data, the primary balance is duplicating data in flat record structures versus referencing data from multiple keys. Space being unlimited you almost always have your best performance with duplicated data, but space isn’t very often unlimited, even in a hadoop cluster, so there is a balance to reduce those indirect lookups. With the queries like determining cohort matches, etc. it can be a challenge to hit the majority of use cases / queries and accept the performance hit for the more rare cases and where needed duplicate data with different compound keys.
Hive is a central repository that is very actively being developed for from both Cloudera and Hortonworks camps. Either way you have gone with your distribution the performance is improving steadily so that is also an option either natively or if accessing HBase or other stores through it.
I believe this might be another project appropriate for this group to approach with a common set of scripts / tables / compound keys. The the imports and exports from / to CDM, and primary queries should be fairly common in hadoop as they have been with the various RDBMs.
That’s an awesome advice. Thank you. I like Oozie as a workstream tool a lot. Regarding performance options, rather than duplicating data of key-value pair, I’m considering columnar storage of data since I’ll be extracting the data from a well-structured RDBMS. In fact, I might try both initially and compare performance before full out deployment. Thank you for excellent reply.
VA is doing a project looking at OMOP in Hadoop. I’m looping in Lewis Frey, who is leading that effort.
You can invite Lewis to the OHDSI Forum thread by clicking the Invite button below. We are definitely interested in hearing what you guys are doing!
I’d be happy to talk with you about the work we are doing on OMOP in Hadoop.
If you’d like to chat, would you like to talk by phone? I’d be looking forward to it. I’m in the Eastern timezone. My phone number is: 425-985-1369 if you wish to call me, or I can call you. Either way that’s convenient for you. Thank you, Scott.
Peter Ghavami, PhD
Would love to have updates posted back as to what you guys discuss in terms of pitfalls / lessons learns / next steps here.
We use the Cloudera CDH5 ecosystem exclusively in conjunction with our FAERS / OMOP CDM hybrid system in postgres.
Specifically, we use Sqoop2 jobs to import data from CDM and our proprietary data into HDFS and then we use Hive to run “vanilla” SQL queries that are otherwise insanely resource / time consuming for things like time-series based disproportionality analyses for hierarchical concept groups.
We have Oozie workflows which are just plain XML files that can do anything from execute shell or RDBMS commands, etc. and can all be scheduled / versioned as data pipelines.
If at all possible a public conference call would be ideal. There are many
I see this topic is old, but I’d love to get in touch with other folks interested in using Hadoop to store and query OMOP data. Has this discussion moved to another forum? I know there are some unofficial support for things like Redshift, does anyone know if there might be undocumented support for Hadoop?
Talk to Amgen. They did that. If you give me your email I can make an intro.
We are very interested in working on this at Georgia Tech. Cloudera is
also doing some work in this direction so we should all join forces.
Perhaps not yet at WG level but perhaps interest group?
@malcolmm83: We working on a poster submission for the 2016 OHDSI symposium that discusses our cloud-based, distributed ETL transformation process. It does not use Hadoop, but our internally-developed process for running the transformation from raw observational data to CDM format (250m+ person years in a few hours).
Apologies for joining this party so late, but I am interested in discussing OMOP on Hadoop (CDH). Will be happy to share my email with any like minds.
Can share any guide, how to implement WebAPI and CDM using Impala and Hadoop?
We are new to OHDSI, currently our CDM implemented in SQL Server. Want to migrate to Hadoop and Impala.
Thanks in advance