ETL approaches for file data

wstephens · February 11, 2015, 4:27pm

All,

I’ve been working through a couple of ETL jobs recently using the a combination of OHDSI and custom ETL tools. One leverages a source RDBMS and the other using files. In both cases I attempt to maintain a separation between syntactic and semantic steps of the process, allowing for different staff to perform activities based on skill sets.

RDBMS to OMOP is pretty well supported…

File source scenario:
Here, were often dealing with heavily coded files (TCGA, SEER, MEDICARE). I’ve been working on a process that leverages schema-free databases to provide a source system for the data.

In this example, I load the Medicare PUF data into MongoDB. The load takes about 1 minute and requires no schema definition.

Install MongoDB, start database, open shell
http://docs.mongodb.org/manual/installation/
Create MongoDB database and collection(s) and keep shell open

$ mongo
> use Medicare
Obtain data files, extract to local directory
http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html
http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DESample01.html

In separate command shell, Load individual files to collections. This is fast. Drugs take longest, but still complete in about a minute.

$cd ~/data/CMS-SynPUF
$mongoimport --db Medicare --collection Beneficiary --type csv --headerline --file DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv

$mongoimport --db Medicare --collection Inpatient --type csv --headerline --file DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.csv

$mongoimport --db Medicare --collection Outpatient --type csv --headerline --file DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.csv

$mongoimport --db Medicare --collection Drug --type csv --headerline --file DE1_0_2008_to_2010_Prescription_Drug_Events_Sample_1.csv

Optionally (in mongo shell), generate new attributes based on OMOP nomenclature to perform syntactic mapping

example: Update Beneficiary set gender_source_value = 1 where BENE_SEX_IDENT_CD = 1

db.Beneficiary.update( {BENE_SEX_IDENT_CD: 1}, { $set: {gender_source_value: 1 } }, {multi: true} )
Write Mongo queries to extract data based on target table in OMOP (example: person)

find all Beneficiaries, return DOB, Sex, Race; ignore generated unique id (wiki messes up this command)
db.Beneficiary.find( {}, {_id:0, DESYNPUF_ID:1, BENE_BIRTH_DT:1, BENE_SEX_IDENT_CD:1, BENE_RACE_CD:1})

>     { "DESYNPUF_ID" : "00013D2EFD8E45D1", "BENE_BIRTH_DT" : NumberLong(19230501), "BENE_SEX_IDENT_CD" : 1, "BENE_RACE_CD" : 1 }
>     { "DESYNPUF_ID" : "00016F745862898F", "BENE_BIRTH_DT" : NumberLong(19430101), "BENE_SEX_IDENT_CD" : 1, "BENE_RACE_CD" : 1 }

Comments encouraged!

Bill

donohara · February 11, 2015, 4:51pm

Thanks Bill - I have all 20 samples downloaded and unzipped, and am looking forward to loading them all into mongodb, and work with it there. It does give me new ideas for working on some of the processing questions I have.

I’ll report back stats (run time, size, etc), running on an small EC2 instance.

What is your infrastructure? Do you plan to load all 20 samples?

Don

wstephens · February 11, 2015, 8:22pm

Don,

I’m working through the ETL details locally right now while I put together a plan for making this a little more generalizable and reproducible.

I was able to create a process for the TCGA data pulled the files from the FTP site and loaded them into Mongo as a single process and a second to extract the data from Mongo for load to OMOP. I’m working a similar approach here.

Bill

lee_evans · February 11, 2015, 9:24pm

Hi Bill,

Lately I’ve become very interested in no-SQL schema-free ETL and analytics.

Thanks for sharing your example using MongoDB. It works well for self-describing files (e.g. csv or json) and I was wondering if you’ve also tried loading any legacy fixed field or other delimited file formats without a header? Have you found any best practice to handling those kinds of file formats with no-SQL?

You might be interested in Apache Drill, it has a plug-in for MongoDB to let you run SQL queries against the data in MongoDB - no RDBMS required.

https://cwiki.apache.org/confluence/display/DRILL/MongoDB+Plugin+for+Apache+Drill

Lee.

wstephens · February 12, 2015, 2:17pm

Don,

Here are the wget commands to retrieve all of the SynPUF files. I cannot attach the shell script here

Bill

aguynamedryan · February 12, 2015, 4:47pm

Bill,

Don has written an awesome Python script to download, unzip, and lightly munge the SynPUF files. It lives here: https://github.com/OHDSI/ETL-CMS

wstephens · February 12, 2015, 5:19pm

@lee_evans

Fixed width file formats, like NCI Seer and NAACCR are a special kind of pain. My POC for both of these data sets leverages the SAS file format data to enable extract of each each row into a hash that can be loaded into Mongo as key-value pairs. Eventually, this will be converted out of my POC language and into something more consumable by users.

Thanks for the pointer to Drill. I’ll check that out.

Bill

aguynamedryan · February 12, 2015, 5:43pm

@wstephens,

I’m wondering if we can’t use a format like the one in this repo: https://github.com/onyxfish/ffs to store information about fixed width formats.

Maybe we could write a simple tool that could use a format file and apply it against a fixed-width data file to transform the data file into a CSV?

If we added a fourth column that contained simple type information, e.g. Date or integer, we could probably spit out a DDL file as well and generate a script to load the resulting CSV into an RDBMS.

Of course, there are problems like SEER’s pedsf has 2200+ columns and won’t fit into most RDBMSs but that’s a perhaps another topic.

Mark_Danese · February 12, 2015, 6:02pm

Ryan’s idea is better and more generalized than what I am about to mention. But it would be simple to do this in R. The LaF package is very fast at reading the fixed width files like SEER data into memory. Then from R, it could be output as a .csv or we could write directly to a database via the dplyr package. But of course, since I only know R, that is my solution for everything.

lee_evans · February 12, 2015, 7:01pm

Thanks guys for sharing these 3 great solutions to the problem of handling fixed width files.

Ryan, I like the simple tool idea. Would we share the common fixed format file definitions (and maybe the generated DDL) in a new OHDSI github repo? One additional benefit would be using github diff to track file changes over time for maintaining ETL scripts.

Is there an obvious common set of (non-proprietary) file definitions/DDL that we would all benefit from maintaining together in a repo?

aguynamedryan · February 19, 2015, 11:55pm

Just noticed I never followed up after thread-jacking this topic.

Turns out csvkit already has this program. It’s called in2csv.

Turns out csvkit already has a program like this. It’s called csvsql.

We would just submit all the FFS files we generate back to GitHub - wireservice/ffs: Schemas to convert common fixed-width file formats into CSV using in2csv.

I’m unable to think of any DDLs that fit the criteria. I suspect that as we write more ETLs as a community, we’ll produce a set of DDLs, but it is probably better to keep those DDLs in the same repos as their ETLs.

wstephens · February 22, 2015, 1:41am

Ryan,

I have a functioning positional filespec for SEER. I’ve been using this to evaluate Kettle, Jasper, Talend and Clover for reading positional files and dumping them into a DB. The same filespec works with cvskit’s in2csv

$ in2csv -f fixed -s SEER9-Incidence.csv incidence/yr2005.lo_2nd_half/BREAST.TXT

Bill

Iru · February 27, 2015, 9:36pm

Bill,

Sorry if this is OT but I’m having difficulty pulling the PUF data from CMS to an Amazon EC2 machine. The requests keep timing out.

Is there some voodoo (cookies, registration etc) necessary to pull the SynPUF files with wget?

wstephens · February 28, 2015, 3:07pm

I haven’t ran into any issue with pulling these files anonymously from any host. It appears that they are provided without restriction.

I’ve seen people complain about this in EC2 before indicating that they had to bypass a proxy. I do not have this problem on my US-East EC2 instance using either wget or curl.

Can you resole the IP address of the CMS server? Should resolve to 146.123.140.205

> nslookup www.cms.gov

wget:
> $ wget http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2010_Beneficiary_Summary_File_Sample_5.zip
> --2015-02-28 09:41:33-- http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2010_Beneficiary_Summary_File_Sample_5.zip
> Resolving www.cms.gov… 146.123.140.205, 2607:f028::400:146:123:140:205
> Connecting to www.cms.gov|146.123.140.205|:80… connected.
> HTTP request sent, awaiting response… 200 OK
> Length: 2958349 (2.8M) [application/x-zip-compressed]
> Saving to: ‘DE1_0_2010_Beneficiary_Summary_File_Sample_5.zip’

curl:
> $ curl http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2010_Beneficiary_Summary_File_Sample_5.zip > DE1_0_2010_Beneficiary_Summary_File_Sample_5.zip
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 2889k 100 2889k 0 0 1458k 0 0:00:01 0:00:01 --:–:-- 1486k

Iru · February 28, 2015, 9:22pm

Update:

A little time with tcpdump has helped explain the problem. The response headers from www.cms.gov are being lost or not sent for some reason. Will talk to AWS about it.

As you can see below, www.cms.gov resolves correctly. I’ve also tried to avoid any proxies. It just doesn’t respond to the headers. Guess I’ll keep digging…

ubuntu@ip-172-31-37-44:/mnt/cms/synpufs$ wget -4 --no-proxy http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip
--2015-02-28 16:42:12--  http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip
Resolving www.cms.gov (www.cms.gov)... 146.123.140.205
Connecting to www.cms.gov (www.cms.gov)|146.123.140.205|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2015-02-28 16:57:13--  (try: 2)  http://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip
Connecting to www.cms.gov (www.cms.gov)|146.123.140.205|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.