Guys, what a great and wonderful call we had today! Thanks Frank for hosting this. I will find out where Frankās notes are and provide a link or work on making them available if otherwise not. Also watch for links to the wiki. I will send out a time for a Hadoop-specific Architecture Call by end of day tomorrow.
Hello people. Here is a link to the OHDSI wiki where I started a Hadoop Working Group page. Not sure how OHDSI does it, but if this is how they do it, please feel free to make suggested changes to the āObjectiveā text I added (assuming you are broadening it rather than cutting pieces out which I heard mentioned today, imho)ā¦
Sorry I missed the call today. Iām at Georgia Tech now and just resubscribing to all my feeds. Iām interested in being added @Frank. If thereās an invite, can you send it to Charity.Hilton@gtri.gatech.edu. Thanks.
Shawn - Sorry I missed the call I am interested in participating.
would like to participate as well. we are actually using CDH5 (Impala specifically) with a few pieces of the vocab - not a full working implementation by any means though.
Everyone, I have updated the wiki page for Hadoop Working Group here: http://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:hadoop-wg
If you go there, I have set meetings as every other Friday (we can decrease frequency as time/need/desire progresses). I am pasting the dial in below, however please just use the wiki as the single source of truth on dial in and schedule.
Schedule: Every other Friday at 8am US PST/11am US EST/4pm UK time zone
Next Meeting: (Date and Time) October 28, 2016, 11am US EST
Call in Number: 1-650-479-3208
Attendee access code: 624188217
good day all,
I am Al Pivonka and our team has working with HL7 CCD to OMOP V4 and Hadoop for over two years now.
One pathway Iād like to explore is the use of Jupiter (http://jupyter.org/) and the existing Achilles (R Scripts) to create a standard Jupiter/Achilles notebook that is able to read from as many formats as possible and storage locations.
Please add me to the list
Al Pivonka (QuintilesIMS)
Folks,
Here are the links to the Impala code generated and uploaded by Tom White. There is no recommendation to use it or value judgment on Impala here, other than multiple parties are using it either free download Apache open source, or via their Cloudera customer relationship.
You can view, comment on, or merge this pull request online at:
Commit Summary
Add scripts for creating a
schema and importing data into Impala.
File Changes
A Impala/DataImport/OMOP_CDM_synpuf_load_Impala.sql (15)
A Impala/OMOP_CDM_ddl_Impala.sql (763) A Impala/README.md (51) A Impala/VocabImport/OMOP_CDM_vocabulary_load_Impala.sql
(9)
Patch Links:
ā
Shawn, Iām interestedā¦
Tom, thanks for the contributionā¦
Shawn, Tom,
Iāve been looking at the HiveDB capabilities (and Iām not experienced enough in the Hadoop space to know if HiveDB is something separate from Impala), but from what I read, it seems quite feasible to support the OHDSI tool stack on top of this structure for the following reasons:
- They are adhering to the ANSI-SQL 92 standard.
- They support temp tables via CREATE TEMPORARY TABLEā¦
- They support window functions (via over (partition by ā¦ order by)) see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
- There is a JDBC driver available to connect to the cluster and execute commands.
These are the key elements that we need in order to support sql translation (via SQLRender) from OHDSI-SQL to the other RDBMS. It would seem that (from what I have read in the documentation) that the closest approximation of dialect in SQLRender is āpostgresqlā so a start of translation could be to just copy those translation rules into a new dialect (called āimpalaā or āhivedbā or whatever makes sense as a 'dialect identifier).
So, I was going to try to try something out offline, but it seems like you guys are way ahead, so instead of doing and showing, I have a suggestion:
With the DDL you provided to create the schema for a CDM V5 and the load scripts you have, maybe youād like to point the load scripts at the CSV files provided by @lee_evans for the SynPUF 1% set and pull those files into a Impala-based CDM instance. Then, to do a test, you can run the Achilles scripts on it, and compare results from the PostgreSQL execution to your own to find out of thereās any differences/problems.
I couldnāt find the 1% synpuf set for download (@lee_evans, could you point us to that)? But this link will get you to the 1k set which is a good starting place:
-Chris
Thanks, @admin!
@shawndolley, @tomwhite:
I did a little more digging to understand Hive vs. Impala. Theyāre different things, I guess Iād think of them as different query engines sitting on top of the same persistence engine. There are some difference between them that could be troublesome: (from http://www.cloudera.com/documentation/enterprise/latest/topics/impala_langref_unsupported.html)
The biggest one here I think is a lack of DATE types (which you have used varchar(8) in your DDLs above). We do tons of things where we do a dateDiff(cola, colb) in code, so not being able to have dates I think is going to be a wrinkle in an Impala implementation. However, there is a TIMESTAMP data type:
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_timestamp.html#timestamp
Could those columns be defined as Timestamp? I believe this is what we use in Oracle and Postgesql.
But, again, Iām not an expert, just giving you my perspective from an OHDSI tools dev.
-Chris
To add some color here on Hive vs. Impala, based on having used them both for several non-OHDSI specific use cases (but still applicable lessons learned):
- I forget what we did to handle DATE vs. TIMESTAMP but it was not any more painful than some pre-processing at ETL or perhaps even just in the query ā basically for an 8-digit integer/varchar date it was a total non-issue. Will follow-up.
In terms of architecture - they are both very similar in some respects:
- Both are intended for āSQL-on-Hadoopā (obviously)
- Size of data needs to match the use case for both: Donāt expect sub-second response times. 2-3 sec is more like it for Impala.
- Both use a metastore database in either MySQL or Postgres (the same as Hive metastore) to store metadata / statistics / table definitions etc.
- Both allow you to physically and logically partition your tables: CLUSTER BY, etc.
- Impala can query existing Hive tables etc without problem ā no real difference AFAIK unless using a columnar storage formatā¦ easy to use
CREATE TABLE ... AS SELECT...
etc.
In terms of performance - they are two very very very different animals:
-
Hive runs more ātradtionallyā through the hadoop job process - so unless you have a big job, you will eat anywhere from 10-60 seconds (YMMV) just for the job to be queued, sent out via the ānormalā hadoop flow of things. After that there isnāt much real difference between a traditional map-reduce job and your SQL it was translated from.
-
Impala on the other hand, IMO, is blazingly fast in comparison. Query response times (again, in our experience) approaches a traditional RDBMS, and for us, the amount of data it can chew down and spit out in < 10 sec, is pretty impressive.
Again, these are non-production workloads in a small cluster (5 datanode + 2 namenodes), but with decent specs and a 20GB dataset with 20-30 columns. IMO, Impala would be the best choice for anything where latency is a concern.
Cool, excellent info! I just wanted to call out the minimum required functionality needed for the OHDSI stack, otherwise I donāt really have a preference either way (tho I would certainly prefer the best performance possible ).
If anyone sets up something with some real or simulated data but they want to see how it executes inside OHDSI, I can help with that.
Thanks again.
-Chris
I am going to throw this out there, having worked on some large (5-6 TB) data using Impala. Some complicated queries with temporal logic could take some time to finish. Things we found that made a large difference:
- Query planner may decide it doesnāt have enough memory to complete and Impala will simply fail. Set mem_limit to something and it will try to run within that limit. It is on a per-node basis. We generally got good results with 25 GB (and often less).
- Use Impala version 2.6 or later. Our most complex queries were much faster (7 hours -> 40 minutes for the worst offender)
- Union all may be much faster than Union and may reduce memory requirements by a lot. Not sure if this is still an issue in 2.6
@Chris_Knoll - thanks for the suggestions about next steps. The scripts I wrote include loading the 1k set to get started. I agree that running against Achilles is something we should try soon. Derek is starting by manually trying a few sample Achilles queries against Impala (see this thread).
Regarding Hive and Impala, they are very closely related in that Impala uses the Hive Metastore for storing metadata, which means that you can use either Hive or Impala to query the same tables. The batch/latency tradeoff between Hive and Impala is well described by @herrcerd; I started with Impala to get the best performance. There are some syntax differences, but my hope is that we can minimize these, but weāll learn more by trying out the actual queries that Achilles makes.
Thanks, @tomwhite! Iāve posted a few example queries in that other thread that should work against cdmV5 and should be standard ansi (except possibly for some date functions?). But itās a good examplar of the type of activity weād see when querying a CDM.
-Chris
Shawn,
Iām interested
Thanks,
Dave
Chris,
The both Hive and Impala utilize the Hive Metastore, and they are both different engines.
Both can read different file formats, Avro, Parquet, txt.
The datatypes can be enforced by the file formats, like Parquet and Avro which each contain meta-data about the types.
hey guys, just a reminder tomorrow is our first con call for Hadoop Working Group! Also, not being a technologist, I hear a lot about Hbase as a data store. I did not hear that come up in the forum as Hive vs Impala. Can someone tell me, is Hbase sort of not optimal for these kinds of analytics? If someone asks me āwhy not use Hbase Shawn, why keep talking only Hive or Impalaā, then what should I say to them?