I’ve been working through a couple of ETL jobs recently using the a combination of OHDSI and custom ETL tools. One leverages a source RDBMS and the other using files. In both cases I attempt to maintain a separation between syntactic and semantic steps of the process, allowing for different staff to perform activities based on skill sets.
RDBMS to OMOP is pretty well supported…
File source scenario:
Here, were often dealing with heavily coded files (TCGA, SEER, MEDICARE). I’ve been working on a process that leverages schema-free databases to provide a source system for the data.
In this example, I load the Medicare PUF data into MongoDB. The load takes about 1 minute and requires no schema definition.
Thanks Bill - I have all 20 samples downloaded and unzipped, and am looking forward to loading them all into mongodb, and work with it there. It does give me new ideas for working on some of the processing questions I have.
I’ll report back stats (run time, size, etc), running on an small EC2 instance.
What is your infrastructure? Do you plan to load all 20 samples?
I’m working through the ETL details locally right now while I put together a plan for making this a little more generalizable and reproducible.
I was able to create a process for the TCGA data pulled the files from the FTP site and loaded them into Mongo as a single process and a second to extract the data from Mongo for load to OMOP. I’m working a similar approach here.
Lately I’ve become very interested in no-SQL schema-free ETL and analytics.
Thanks for sharing your example using MongoDB. It works well for self-describing files (e.g. csv or json) and I was wondering if you’ve also tried loading any legacy fixed field or other delimited file formats without a header? Have you found any best practice to handling those kinds of file formats with no-SQL?
You might be interested in Apache Drill, it has a plug-in for MongoDB to let you run SQL queries against the data in MongoDB - no RDBMS required.
Fixed width file formats, like NCI Seer and NAACCR are a special kind of pain. My POC for both of these data sets leverages the SAS file format data to enable extract of each each row into a hash that can be loaded into Mongo as key-value pairs. Eventually, this will be converted out of my POC language and into something more consumable by users.
Thanks for the pointer to Drill. I’ll check that out.
I’m wondering if we can’t use a format like the one in this repo: https://github.com/onyxfish/ffs to store information about fixed width formats.
Maybe we could write a simple tool that could use a format file and apply it against a fixed-width data file to transform the data file into a CSV?
If we added a fourth column that contained simple type information, e.g. Date or integer, we could probably spit out a DDL file as well and generate a script to load the resulting CSV into an RDBMS.
Of course, there are problems like SEER’s pedsf has 2200+ columns and won’t fit into most RDBMSs but that’s a perhaps another topic.
Ryan’s idea is better and more generalized than what I am about to mention. But it would be simple to do this in R. The LaF package is very fast at reading the fixed width files like SEER data into memory. Then from R, it could be output as a .csv or we could write directly to a database via the dplyr package. But of course, since I only know R, that is my solution for everything.
Thanks guys for sharing these 3 great solutions to the problem of handling fixed width files.
Ryan, I like the simple tool idea. Would we share the common fixed format file definitions (and maybe the generated DDL) in a new OHDSI github repo? One additional benefit would be using github diff to track file changes over time for maintaining ETL scripts.
Is there an obvious common set of (non-proprietary) file definitions/DDL that we would all benefit from maintaining together in a repo?
I’m unable to think of any DDLs that fit the criteria. I suspect that as we write more ETLs as a community, we’ll produce a set of DDLs, but it is probably better to keep those DDLs in the same repos as their ETLs.
I have a functioning positional filespec for SEER. I’ve been using this to evaluate Kettle, Jasper, Talend and Clover for reading positional files and dumping them into a DB. The same filespec works with cvskit’s in2csv
I haven’t ran into any issue with pulling these files anonymously from any host. It appears that they are provided without restriction.
I’ve seen people complain about this in EC2 before indicating that they had to bypass a proxy. I do not have this problem on my US-East EC2 instance using either wget or curl.
Can you resole the IP address of the CMS server? Should resolve to 146.123.140.205
A little time with tcpdump has helped explain the problem. The response headers from www.cms.gov are being lost or not sent for some reason. Will talk to AWS about it.
As you can see below, www.cms.gov resolves correctly. I’ve also tried to avoid any proxies. It just doesn’t respond to the headers. Guess I’ll keep digging…