OHDSI Home | Forums | Wiki | Github

First Achilles queries working with Hadoop

Thanks @tomwhite. Thats great! As one of the other developers on Achilles, I enjoyed look this over.

A few things:

  1. This is very awesome.
  2. I wish I had a Hadoop cluster with Impala :frowning: I know my kung-fu but not sure if I have the time to set up YAP. (yet another platform) :wink: Maybe some of the internal IT folks have the bandwith to set it up.
  3. I’d put some priority around finding a work-around (for lack of a better word?) for deleting from tables. Achilles you might be able to get away with disabling deletes for a given execution (and then being careful to only insert new analysis results into the table or use the cleanup command from impala to reset the tables), but if you want to get into any of the other OHDSI tools that are hosted in webapi, we defintely start every analysis with DELETE FROM {resultTable} where analysis_id = {analysis_id}; // execute analysis and insert new data for {analysis_id}. It would suck to get all of this working and then have that limitation block you from being able to use any of the other tools.
  4. This is very awesome.

If I can get a cluster running, I can help out with some of the query work, I know them pretty well. I think the most complicated ones will involve calculating a distribution of values (as in given a series of values, what’s the p10, p25, median, p75, p90 values). It’s the slowest operation in SQL due to the massive amount of row sorting, but I’ve added an optimization to the ANSI sql that limits the amount of sorting, but not sure how a hadoop based engine handles it. I’m not sure if we ever can have a one size fits all.

-Chris

One other thing, I noticed you called out something that doesn’t work yet:
longAnalyses1=c(1500:1699)

1500-1699 analysis_ids are related to the old procedure_cost and drug_cost tables. We haven’t updated the achilles to run on the new cost tables (help wanted) so don’t even worry about those. To skip the cost analysis when running achilles you can execute achilles with the parameter: runCostAnalysis = FALSE. runCostAnalysis defaults to FALSE, btw, so you actually don’t need to set that, but I assume you went through the queries and were running them batch at a time (100 , 200, 300 etc), and then you got to 1500 and it blew up. You’ll be happy to know, this is by design! not really. But it’s not hadoop’s fault. It’s just outdated code.

-Chris

Can’t you rewrite every delete as a create table with the clause inverted, drop the orginal, and rename the new table to the original?

Yes, and we can also all drink sand milkshakes.

@Chris_Knoll – Cloudera has a VM that you can run on your computer for testing, so you don’t need to power up a whole cluster. Might be good enough for a first look. Cloudera uses it for training.

Thanks for all the comments!

The Cloudera QuickStart VM is very useful for this kind of thing, as @Mark_Danese points out. It now comes in a Docker version too. See http://www.cloudera.com/downloads/quickstart_vms/5-8.html

@Chris_Knoll, DELETE FROM {resultTable} WHERE … is a problem on Hadoop in general, as you can’t delete individual rows. If there isn’t a workaround in SQL, then we could try using Kudu for storage since that does support deletion of arbitrary rows. I noticed that DELETE FROM is only used in a handful of places though, so may not be necessary.

Also, I didn’t know that runCostAnalysis existed (or defaults to FALSE). Good to know.

Finally, what is Achilles Heel (apart from being a bad pun)?

Cheers,
Tom

Achilles heel is a set of QC queries that checks for things in the achilles_results table (such as there are people with age < 0, or there are events from prior to 1800 something like that). The output of heel is just a simple table with a message, rule_id, and row_count. So if we saw a certain number of people with negative age, the message woudl be like ‘Error: People with negative age’ and rowcount would be the number of people we found. It’s just a way to see how many rows in the data are violating a check. We also have some warnings too which are ‘soft errors’ but maybe something to look at in your ETL process to correct.

Re: the delete from…

You can think of an achilles as a one-shot deal: you execute it to get the stats, then exportToJson() will produce the report-ready data.

But pretty much everything else we do involves updating a results table. There could be hundreds and hundreds of cohort definitions where the results are stored in the {result_schema}.cohort table (with bilions of rows), and between generation we just want to remove the prior results and generate new ones. Doing the trick of create table as select and select everytying except what you’re trying to delete has a problem of efficiency (you’e basically copying 99% of a table just to remove that 1%), and I don’t even know how that would handle simultaneous generations (where you could have say 5 generations happening at the same time).

-Chris

This is a very nice milestone to reach.

For the delete some rows problem - if I would speak only for achilles() call (from R) (not webAPI calls) - another option would be to always run achilles as one big block. (with createTable=true) always. (as a usage scenario for Hadoop).
In this case, it is always droping all achilles tables and never deletes just some rows.

@Chris_Knoll @Vojtech_Huser Running Achilles in “batch” mode fits the Hadoop use case a lot better, so that’s the best place to start (which is what I’m doing).

@Chris_Knoll Thanks for the explanation of what Heel does. So it sounds like we should try to get that running too, once the main Achilles queries are working OK.

@Chris_Knoll @tomwhite I agree re prioritizing Heel.

I’ve got all of the Achilles queries working now. See https://github.com/OHDSI/Achilles/compare/master...tomwhite:impala.

Most of the changes are to add explicit casts. As mentioned in the SqlRender documentation SQL Server is less explicit than other dialects, so casts aren’t needed. However, Impala is quite strict, so they need adding.

Cheers,
Tom

That’s great @tomwhite! I have been looking your implementation over and it looks great so far.

Thanks @t_abdul_basser!

Hi. We’re excited to try out your Impala version of SqlRender. I’m having some issues installing it. Maybe someone has a suggestion. It seem to install ok on my R version 3.2.3 but I’m getting an error on my R 3.3.2 .

install_github(“tomwhite/SqlRender”, ref=“impala”)
Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Failure when receiving data from the peer

We search indicates a proxy problem but that didn’t seem to help. I’m able to download the github rep, so I guess I can just build the package there and install locally?

Hey, @tomwhite, I was just following up on this point. After seeing some of the application demos written on top of a CDM hosted by Hadoop, I was thinking that there’s a set of OHDSI tools that you may want to see run on Cloudera but the lack of delete is a significant problem. You mentioned Kudu and that actually does seem like the solution to the problem, and was wondering if it would be difficult to take the Achilles build you have, create the ACHILLES_RESULTS and ACHILLES_RESULTS_DIST tables (these are the tables that get inserted/deleted into) as kudu tables, and then change the SQLRender delete rule to not comment out the DELETE statement and just let the deletes run through the Achilles process as normal.

If you can show this as a working proof of concept, then you could have Atlas running (hosted on postgresql, but pointing to a Cloudera-based CDM that analytics are executed on. For the Atlas tools (the service layer is called ‘WebAPI’), WebAPI expects a set of tables to be created in a ‘results schema’ which is created on a database-vendor-by-database-vendor basis (each vendor has their own DDL to set it up), So, if we know kudu tables can live side-by-side (but in a different schema) to the other read-only Hadoop tables, then we can create Kudu tables in the results schema, and the changes you made to sqlRender should let those tools query against the Impala db.

I’m happy to work with you getting this running in your own environment if you are interested.

-Chris

Hi @Chris_Knoll - that’s a great idea! You certainly can have Kudu tables side-by-side with HDFS-backed tables, so this should be very feasible.

It would be great to have some help on this. I’ve been working on a private cluster, so it’s not easy to share access to that. Let’s see if we can come up with a way to have shared resources on this.

Cheers,
Tom

Private clusters is no problem. May times I’ve hosted a webex where participants share their screen such that I can dig into what they’re doing and troubleshoot setup issues. So, if that’s a modality that works for you, I can do that.

-Chris

Tom,

I recently joined the company and was asked to look into Achilles. I have a very little insight on it and saw your github and got little more knowledg. I still need your help to make it work. Is there anyway to contact you?

Thanks
Somesh

t