ATLAS on Impala

tomwhite · July 6, 2017, 1:39pm

Hi everyone,

Following last week’s successful Hadoop Hackathon (https://www.ohdsi.org/photos-from-2017-hadoop-hack-a-thon/), I’ve been looking at ATLAS on Impala a bit more. I’ve made some progress, but have hit a blocker (probably in my understanding of the database schemas).

The progress is that I’ve got ATLAS running on a combination of Postgres (for the OHDSI tables) and Impala (for the CDM and Achilles tables) for some parts at least. In particular, Data sources and Vocabulary work. Cohort generation is working better than it was at the Hackathon in that the generated Impala SQL will now execute (I’ve fixed the bugs we hit, see https://github.com/OHDSI/Atlas/issues/418).

However, I’m having some trouble with getting the table mapping right - in particular the cohort_inclustion table, and which schema/database it should be in and how it is managed. This may be a problem with having two databases. If someone could explain how it is meant to work that would be very helpful. I’ve added a comment here with some more detail: https://github.com/OHDSI/Atlas/issues/418#issuecomment-313396510

Thanks,
Tom

Chris_Knoll · July 6, 2017, 6:27pm

Just to repost my response from github:

In the case of > 1 databases, the cdm results tables must be created manually. This article: http://www.ohdsi.org/web/wiki/doku.php?id=documentation:software:webapi:multiple_datasets_configuration describes which tables need to be created in the cdm results schema.

-Chris

tomwhite · July 7, 2017, 10:36am

Thanks for the help Chris. I have now managed to get cohort generation in ATLAS working with Impala.

Tom

Chris_Knoll · July 7, 2017, 4:30pm

Except for that pesky detail of ignoring delete statements. Can this be addressed? The tools won’t work properly if it can’t clear the prior results.

Dahye_Shin · April 24, 2018, 8:05am

Dear, @tomwhite

We are so excited to hear that ATLAS running on Impala.

So if, when we convert the cdm data based on the impala,
Is it possible to run analysis code (R code) generated by ATLAS without error?

If it is possible, would you inform to us about the points of watch out when running the analysis code?

admin · June 22, 2018, 3:32pm

@tomwhite @Chris_Knoll @gregk Do you have any updates to share on the current status of Atlas on Impala? Does Atlas cohort generation completely work or is there still an open issue with the deletion of prior results due to the HDFS append only write limitation?

If HDFS deletes are still an issue, one solution could be to implement a separate cohort ‘soft delete’ key table that could be appended to with the deleted cohort row keys.

For Impala, Sqlrender could translate cohort table deletes to insert the deleted cohort row keys into the ‘soft delete’ table.

A cohort view, created just for Impala deployments, could be used to join the cohort table to the ‘soft delete’ table as a way to transparently ignore the ‘soft deleted’ cohort rows.

A separate batch SQL process could be scheduled to re-create the cohort table on a periodic (nightly/weekly) basis minus the soft deleted rows. It would be similar to running a table ‘vacuum’ process in postgres/netezza.

Are there any other solutions currently being investigated?

gregk · June 22, 2018, 3:40pm

@admin - hey Lee, ATLAS should fully support Impala at this point, we cleaned up Impala support quite a bit in the past few months. I am not aware of any outstanding issues at this point.

lee_evans · June 22, 2018, 4:37pm

@gregk Thanks. That’s great news!

How did you end up handling the Impala HDFS deletes of previous results?