The new Working Group for Hadoop

shawndolley · October 20, 2016, 8:57pm

Guys, what a great and wonderful call we had today! Thanks Frank for hosting this. I will find out where Frank’s notes are and provide a link or work on making them available if otherwise not. Also watch for links to the wiki. I will send out a time for a Hadoop-specific Architecture Call by end of day tomorrow.

shawndolley · October 20, 2016, 9:24pm

Hello people. Here is a link to the OHDSI wiki where I started a Hadoop Working Group page. Not sure how OHDSI does it, but if this is how they do it, please feel free to make suggested changes to the “Objective” text I added (assuming you are broadening it rather than cutting pieces out which I heard mentioned today, imho)…

cahilton · October 20, 2016, 11:42pm

Sorry I missed the call today. I’m at Georgia Tech now and just resubscribing to all my feeds. I’m interested in being added @Frank. If there’s an invite, can you send it to Charity.Hilton@gtri.gatech.edu. Thanks.

chad_dau · October 21, 2016, 4:14am

Shawn - Sorry I missed the call I am interested in participating.

herrcerd · October 21, 2016, 2:35pm

would like to participate as well. we are actually using CDH5 (Impala specifically) with a few pieces of the vocab - not a full working implementation by any means though.

shawndolley · October 21, 2016, 9:08pm

Everyone, I have updated the wiki page for Hadoop Working Group here: http://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:hadoop-wg
If you go there, I have set meetings as every other Friday (we can decrease frequency as time/need/desire progresses). I am pasting the dial in below, however please just use the wiki as the single source of truth on dial in and schedule.

Schedule: Every other Friday at 8am US PST/11am US EST/4pm UK time zone

Next Meeting: (Date and Time) October 28, 2016, 11am US EST

Call in Number: 1-650-479-3208

Attendee access code: 624188217

WebEx: http://cloudera.webex.com/meet/sdolley

Al_Pivonka · October 24, 2016, 12:36pm

good day all,
I am Al Pivonka and our team has working with HL7 CCD to OMOP V4 and Hadoop for over two years now.
One pathway I’d like to explore is the use of Jupiter (http://jupyter.org/) and the existing Achilles (R Scripts) to create a standard Jupiter/Achilles notebook that is able to read from as many formats as possible and storage locations.

Please add me to the list
Al Pivonka (QuintilesIMS)

shawndolley · October 24, 2016, 2:45pm

Folks,
Here are the links to the Impala code generated and uploaded by Tom White. There is no recommendation to use it or value judgment on Impala here, other than multiple parties are using it either free download Apache open source, or via their Cloudera customer relationship.

You can view, comment on, or merge this pull request online at:

Add scripts for creating a schema and importing data into Impala. by tomwhite · Pull Request #37 · OHDSI/CommonDataModel · GitHub

Commit Summary

Add scripts for creating a
schema and importing data into Impala.

File Changes

A Impala/DataImport/OMOP_CDM_synpuf_load_Impala.sql (15)
A Impala/OMOP_CDM_ddl_Impala.sql (763) A Impala/README.md (51) A Impala/VocabImport/OMOP_CDM_vocabulary_load_Impala.sql
(9)

Patch Links:

https://github.com/OHDSI/CommonDataModel/pull/37.patchhttps://github.com/OHDSI/CommonDataModel/pull/37.diff

—

Brack_Coalson · October 24, 2016, 3:45pm

Shawn, I’m interested…

Tom, thanks for the contribution…

Chris_Knoll · October 24, 2016, 5:49pm

Shawn, Tom,
I’ve been looking at the HiveDB capabilities (and I’m not experienced enough in the Hadoop space to know if HiveDB is something separate from Impala), but from what I read, it seems quite feasible to support the OHDSI tool stack on top of this structure for the following reasons:

They are adhering to the ANSI-SQL 92 standard.
They support temp tables via CREATE TEMPORARY TABLE…
They support window functions (via over (partition by … order by)) see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
There is a JDBC driver available to connect to the cluster and execute commands.

These are the key elements that we need in order to support sql translation (via SQLRender) from OHDSI-SQL to the other RDBMS. It would seem that (from what I have read in the documentation) that the closest approximation of dialect in SQLRender is ‘postgresql’ so a start of translation could be to just copy those translation rules into a new dialect (called ‘impala’ or ‘hivedb’ or whatever makes sense as a 'dialect identifier).

So, I was going to try to try something out offline, but it seems like you guys are way ahead, so instead of doing and showing, I have a suggestion:

With the DDL you provided to create the schema for a CDM V5 and the load scripts you have, maybe you’d like to point the load scripts at the CSV files provided by @lee_evans for the SynPUF 1% set and pull those files into a Impala-based CDM instance. Then, to do a test, you can run the Achilles scripts on it, and compare results from the PostgreSQL execution to your own to find out of there’s any differences/problems.

I couldn’t find the 1% synpuf set for download (@lee_evans, could you point us to that)? But this link will get you to the 1k set which is a good starting place:

-Chris

admin · October 24, 2016, 5:57pm

The 1% synpuf and the full synpuf data sets are here:

ftp://ftp.ohdsi.org/synpuf/

Chris_Knoll · October 24, 2016, 6:49pm

Thanks, @admin!

@shawndolley, @tomwhite:
I did a little more digging to understand Hive vs. Impala. They’re different things, I guess I’d think of them as different query engines sitting on top of the same persistence engine. There are some difference between them that could be troublesome: (from http://www.cloudera.com/documentation/enterprise/latest/topics/impala_langref_unsupported.html)

The biggest one here I think is a lack of DATE types (which you have used varchar(8) in your DDLs above). We do tons of things where we do a dateDiff(cola, colb) in code, so not being able to have dates I think is going to be a wrinkle in an Impala implementation. However, there is a TIMESTAMP data type:
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html#timestamp

Could those columns be defined as Timestamp? I believe this is what we use in Oracle and Postgesql.

But, again, I’m not an expert, just giving you my perspective from an OHDSI tools dev.

-Chris

herrcerd · October 24, 2016, 10:04pm

To add some color here on Hive vs. Impala, based on having used them both for several non-OHDSI specific use cases (but still applicable lessons learned):

I forget what we did to handle DATE vs. TIMESTAMP but it was not any more painful than some pre-processing at ETL or perhaps even just in the query – basically for an 8-digit integer/varchar date it was a total non-issue. Will follow-up.

In terms of architecture - they are both very similar in some respects:

Both are intended for “SQL-on-Hadoop” (obviously)
Size of data needs to match the use case for both: Don’t expect sub-second response times. 2-3 sec is more like it for Impala.
Both use a metastore database in either MySQL or Postgres (the same as Hive metastore) to store metadata / statistics / table definitions etc.
Both allow you to physically and logically partition your tables: CLUSTER BY, etc.
Impala can query existing Hive tables etc without problem – no real difference AFAIK unless using a columnar storage format… easy to use CREATE TABLE ... AS SELECT... etc.

In terms of performance - they are two very very very different animals:

Hive runs more “tradtionally” through the hadoop job process - so unless you have a big job, you will eat anywhere from 10-60 seconds (YMMV) just for the job to be queued, sent out via the “normal” hadoop flow of things. After that there isn’t much real difference between a traditional map-reduce job and your SQL it was translated from.
Impala on the other hand, IMO, is blazingly fast in comparison. Query response times (again, in our experience) approaches a traditional RDBMS, and for us, the amount of data it can chew down and spit out in < 10 sec, is pretty impressive.

Again, these are non-production workloads in a small cluster (5 datanode + 2 namenodes), but with decent specs and a 20GB dataset with 20-30 columns. IMO, Impala would be the best choice for anything where latency is a concern.

Chris_Knoll · October 25, 2016, 12:47am

Cool, excellent info! I just wanted to call out the minimum required functionality needed for the OHDSI stack, otherwise I don’t really have a preference either way (tho I would certainly prefer the best performance possible ).

If anyone sets up something with some real or simulated data but they want to see how it executes inside OHDSI, I can help with that.

Thanks again.

-Chris

Mark_Danese · October 25, 2016, 6:40am

I am going to throw this out there, having worked on some large (5-6 TB) data using Impala. Some complicated queries with temporal logic could take some time to finish. Things we found that made a large difference:

Query planner may decide it doesn’t have enough memory to complete and Impala will simply fail. Set mem_limit to something and it will try to run within that limit. It is on a per-node basis. We generally got good results with 25 GB (and often less).
Use Impala version 2.6 or later. Our most complex queries were much faster (7 hours -> 40 minutes for the worst offender)
Union all may be much faster than Union and may reduce memory requirements by a lot. Not sure if this is still an issue in 2.6

tomwhite · October 25, 2016, 1:13pm

@Chris_Knoll - thanks for the suggestions about next steps. The scripts I wrote include loading the 1k set to get started. I agree that running against Achilles is something we should try soon. Derek is starting by manually trying a few sample Achilles queries against Impala (see this thread).

Regarding Hive and Impala, they are very closely related in that Impala uses the Hive Metastore for storing metadata, which means that you can use either Hive or Impala to query the same tables. The batch/latency tradeoff between Hive and Impala is well described by @herrcerd; I started with Impala to get the best performance. There are some syntax differences, but my hope is that we can minimize these, but we’ll learn more by trying out the actual queries that Achilles makes.

Chris_Knoll · October 25, 2016, 5:12pm

Thanks, @tomwhite! I’ve posted a few example queries in that other thread that should work against cdmV5 and should be standard ansi (except possibly for some date functions?). But it’s a good examplar of the type of activity we’d see when querying a CDM.

-Chris

dficenec · October 27, 2016, 11:07am

Shawn,

I’m interested

Thanks,
Dave

Al_Pivonka · October 27, 2016, 2:25pm

Chris,
The both Hive and Impala utilize the Hive Metastore, and they are both different engines.
Both can read different file formats, Avro, Parquet, txt.

The datatypes can be enforced by the file formats, like Parquet and Avro which each contain meta-data about the types.

shawndolley · October 27, 2016, 2:37pm

hey guys, just a reminder tomorrow is our first con call for Hadoop Working Group! Also, not being a technologist, I hear a lot about Hbase as a data store. I did not hear that come up in the forum as Hive vs Impala. Can someone tell me, is Hbase sort of not optimal for these kinds of analytics? If someone asks me ‘why not use Hbase Shawn, why keep talking only Hive or Impala’, then what should I say to them?