OHDSI Home | Forums | Wiki | Github

The new Working Group for Hadoop

The 1% synpuf and the full synpuf data sets are here:

ftp://ftp.ohdsi.org/synpuf/

Thanks, @admin!

@shawndolley, @tomwhite:
I did a little more digging to understand Hive vs. Impala. They’re different things, I guess I’d think of them as different query engines sitting on top of the same persistence engine. There are some difference between them that could be troublesome: (from http://www.cloudera.com/documentation/enterprise/latest/topics/impala_langref_unsupported.html)

The biggest one here I think is a lack of DATE types (which you have used varchar(8) in your DDLs above). We do tons of things where we do a dateDiff(cola, colb) in code, so not being able to have dates I think is going to be a wrinkle in an Impala implementation. However, there is a TIMESTAMP data type:
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html#timestamp

Could those columns be defined as Timestamp? I believe this is what we use in Oracle and Postgesql.

But, again, I’m not an expert, just giving you my perspective from an OHDSI tools dev.

-Chris

To add some color here on Hive vs. Impala, based on having used them both for several non-OHDSI specific use cases (but still applicable lessons learned):

  • I forget what we did to handle DATE vs. TIMESTAMP but it was not any more painful than some pre-processing at ETL or perhaps even just in the query – basically for an 8-digit integer/varchar date it was a total non-issue. Will follow-up.

In terms of architecture - they are both very similar in some respects:

  • Both are intended for “SQL-on-Hadoop” (obviously)
  • Size of data needs to match the use case for both: Don’t expect sub-second response times. 2-3 sec is more like it for Impala.
  • Both use a metastore database in either MySQL or Postgres (the same as Hive metastore) to store metadata / statistics / table definitions etc.
  • Both allow you to physically and logically partition your tables: CLUSTER BY, etc.
  • Impala can query existing Hive tables etc without problem – no real difference AFAIK unless using a columnar storage format… easy to use CREATE TABLE ... AS SELECT... etc.

In terms of performance - they are two very very very different animals:

  • Hive runs more “tradtionally” through the hadoop job process - so unless you have a big job, you will eat anywhere from 10-60 seconds (YMMV) just for the job to be queued, sent out via the “normal” hadoop flow of things. After that there isn’t much real difference between a traditional map-reduce job and your SQL it was translated from.

  • Impala on the other hand, IMO, is blazingly fast in comparison. Query response times (again, in our experience) approaches a traditional RDBMS, and for us, the amount of data it can chew down and spit out in < 10 sec, is pretty impressive.

Again, these are non-production workloads in a small cluster (5 datanode + 2 namenodes), but with decent specs and a 20GB dataset with 20-30 columns. IMO, Impala would be the best choice for anything where latency is a concern.

1 Like

Cool, excellent info! I just wanted to call out the minimum required functionality needed for the OHDSI stack, otherwise I don’t really have a preference either way (tho I would certainly prefer the best performance possible :slight_smile: ).

If anyone sets up something with some real or simulated data but they want to see how it executes inside OHDSI, I can help with that.

Thanks again.

-Chris

I am going to throw this out there, having worked on some large (5-6 TB) data using Impala. Some complicated queries with temporal logic could take some time to finish. Things we found that made a large difference:

  • Query planner may decide it doesn’t have enough memory to complete and Impala will simply fail. Set mem_limit to something and it will try to run within that limit. It is on a per-node basis. We generally got good results with 25 GB (and often less).
  • Use Impala version 2.6 or later. Our most complex queries were much faster (7 hours -> 40 minutes for the worst offender)
  • Union all may be much faster than Union and may reduce memory requirements by a lot. Not sure if this is still an issue in 2.6
1 Like

@Chris_Knoll - thanks for the suggestions about next steps. The scripts I wrote include loading the 1k set to get started. I agree that running against Achilles is something we should try soon. Derek is starting by manually trying a few sample Achilles queries against Impala (see this thread).

Regarding Hive and Impala, they are very closely related in that Impala uses the Hive Metastore for storing metadata, which means that you can use either Hive or Impala to query the same tables. The batch/latency tradeoff between Hive and Impala is well described by @herrcerd; I started with Impala to get the best performance. There are some syntax differences, but my hope is that we can minimize these, but we’ll learn more by trying out the actual queries that Achilles makes.

Thanks, @tomwhite! I’ve posted a few example queries in that other thread that should work against cdmV5 and should be standard ansi (except possibly for some date functions?). But it’s a good examplar of the type of activity we’d see when querying a CDM.

-Chris

Shawn,

I’m interested

Thanks,
Dave

Chris,
The both Hive and Impala utilize the Hive Metastore, and they are both different engines.
Both can read different file formats, Avro, Parquet, txt.

The datatypes can be enforced by the file formats, like Parquet and Avro which each contain meta-data about the types.

hey guys, just a reminder tomorrow is our first con call for Hadoop Working Group! Also, not being a technologist, I hear a lot about Hbase as a data store. I did not hear that come up in the forum as Hive vs Impala. Can someone tell me, is Hbase sort of not optimal for these kinds of analytics? If someone asks me ‘why not use Hbase Shawn, why keep talking only Hive or Impala’, then what should I say to them?

Hi Shawn, How/Where can we get info about Hadoop Working Group conference call details?

Hi Naga - go to this page on the Wiki and scroll down and you will see it
http://www.ohdsi.org/web/wiki/doku.php/?id=projects:workgroups:hadoop-wg
Or here is the info
Schedule: Every other Friday at 8am US PST/11am US EST/4pm UK time zone

Next Meeting: (Date and Time) October 28, 2016, 11am US EST

Call in Number: 1-650-479-3208

Attendee access code: 624188217

WebEx: http://cloudera.webex.com/meet/sdolley

Thank You Shawn.

Great first meeting guys and gals. Here is a link to find the raw and original meeting log http://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:hadoopwgmeetingloglandingpage . To find this or if the link is broken

  1. Go to the OHDSI Wiki by clicking on the word ‘Wiki’ at the top of any page in the forum.
  2. Click on Projects and working groups link at the left side of that Wiki home page
  3. Click on Hadoop Working Group
    Wow! Look at all this great info, I better bookmark this page!
  4. Scroll down until you see Hadoop Working Group Meeting Logs link (or something like that), click it.
  5. Now you should see a page where I will post the logs. For now, will be in MSFT Word format (sorry if that is painful).

I wanted to follow up on something from the other days meeting and to share a bit of information.
I believe that by targeting “sql” as the standard language for integration with the Hadoop platform is ideal.

The underlying architecture(s) (Hadoop/Yarn/Mesos/ or storage HDFS/S3) should not be the focus. These are all configurable.

I would suggest for the team to look at Hadoop as an application/data container, the container provides services on top of the data.

Our focus should be on standards for accessing the data utilizing the tools within the container.

Here is an ever growing list of projects and their classifications with in the Hadoop container.

In the context of Hadoop being an application/data container, one can also build out a container with specific tools so that the container only contains the tools needed.
Examples:

  1. (Open-source) Ambari (Management and provisioning) + HDFS + Spark + Spark SQL + Sqoop + Ozzie + Hive/
  2. Both Cloudera CDH and Hortonworks HDP (Full stack or only deploy the tools needed)
    So if we stay focused on the standards and our use cases, we will be able to work within any mixture/stack someone puts together.

Just my thoughts

Folks, I have put the formatted notes from Friday’s architecture call on the Wiki, FYI.

Hello. As promised, here is a ‘voting doc’ or better named a standard doc to enter your feedback on the use cases and next things to build or work out. If you all rank order your priorities (which can be hard!) I will tally the results (or someone else can tally them). The benefit of this is that we can ensure time in the agenda and people hours we can get a hold of are being spent on things benefitting the most members. I don’t know how to attach the doc to the forum, but I will figure it out. Doc right now is on Wiki at Projects & Workgroups… Hadoop WG…Meeting Logs: Hadoop WG Meeting Logs. Then when you go to that page, you will see Other Docs down below, and the ‘voting doc’ for lack of a better name. Please either email back to sdolley@cloudera.com or somehow post it somewhere with a pointer to it so I can tally it.

Link to page with Priorities Voting Doc

Hi

This is Chetan (@QuintilesIMS). Please add me to the list.

Regards

Here is a link to a survey format of the questions on the table. No one responded to the Word doc approach, so I am going to keep hounding you all: https://www.surveymonkey.com/r/D6C3W32 . I have never done a Survey Monkey so if this link doesn’t work, or asks for money or some other weird thing, let me know. It would be great to get lots of data back and analyze it! (have you heard that before…)

Hadoop Interested People! Here is a prospective agenda below for our Hadoop Working Group call tomorrow. Please consider this post a call for additional agenda items, please post in the forum or if you don’t want to you can email to me at sdolley@cloudera.com

  1. Intros of first time attendees to the call/working group
  2. Old Business: the questionnaire/survey. I want to bring the survey up on screen during the meeting, and get commits from people to fill it out!
  3. New Business
    a) do we want or need a reference architecture
    b) what can we be doing in parallel over next few weeks
    c) I will navigate on screen to what Tom White has coded so everyone knows where it is (or Tom will) and Tom can describe what you can and can’t do with it, its limitations (assuming Tom is on)
    d) can we get a checklist of things we agree should be developed (that’s what survey is for in part)
    e) throw it open to audience for topics, questions, lines of discussion

The call will be recorded as per OHDSI standard.

t