The biggest use case for Achilles

vojtech.filipec · July 1, 2016, 2:58pm

I am implementing Achilles at Merck (known as MSD outside of the US and Canada) and we are working with electronic medical records coming from GE Centricity (Quintiles EMR). I am not able to finish the query from R to get the summary for Achilles Web - after four days of running, the query repeatedly failed. In R, I just see the progress bar stuck at 51%. In the Oracle database, I see no activity. The patient table has 18,000,000 rows and I am concerned about implications of this size.

My questions:

What would be a typical size of databases people here are working with (in terms of the # of rows in the patient table)?
Has anyone worked with data of this size?

Thank you for sharing
/ Vojtech

Patrick_Ryan · July 1, 2016, 4:36pm

Hi @vojtech.filipec, Your data is not especially large, we have many
databases throughout the community which are much larger who have
successfully run ACHILLES. For example, our instance of the Truven CCAE
database has ~120m records in the PERSON table. I seem to recall one
person running into this issue, and they found they hadn’t indexed their
data and the vocabulary tables, and once they did, it ran better.
@t_abdul_basser or @Chris_Knoll might be able to provide better insights
if you can share the specific query you’re getting hung up on.

Chris_Knoll · July 2, 2016, 4:14am

I can add a debug flag to Achilles that will generate the Achilles statistics generation script, but not execute it. This way you can run it with the debug flag, open up the generated script file, and then execute parts of the script to determine where the bottleneck is for you.

Would that be helpful?

DTorok · July 3, 2016, 7:47pm

Yes please do. I’m running on RedShift and cannot get beyond it saying 58%
complete. Always hanging on the same insert. I think the connection is
timing out, but being able to run the individual scripts may help to
isolate and resolve the problem.

Chris_Knoll · July 5, 2016, 2:52pm

@DTorok,
I’ve just pushed to master in OHDSI/Achilles. There is now a sqlOnly parameter you can pass in the achilles() call that will just write the achilles.sql to an output/ folder. It will NOT execute the sql, just write it and exit.

Using that sql, you should be able to execute the block of sql that is giving you a problem. I’m sorry I can’t say which one is the 58% query, but if you killed it in RStudio you might have gotten an error report to tell you which sql failed.

-Chris

vojtech.filipec · July 6, 2016, 7:54pm

Thank you for your hint with indexing, @Patrick_Ryan; and for your adjustment in the code, @Chris_Knoll.

As my R got stuck at 51% several times, our DB admin set up a traced run to isolate the trouble-making query. We know now which query is causing the issue but never really got over it. I am almost sure the data is not indexed - will work on that now. (FYI: It was the query that joins the PERSON table with a sub-select of OBSERVATION_PERIOD and inserts the results into ACHILLES_results_dist with the value of 105 as analysis_id; if anyone is interested in details here, please send me a note and I’ll post the entire select)

Chris_Knoll · July 7, 2016, 3:26am

Right, 105 is calcuating the distribution of observation period durations:
https://github.com/OHDSI/Achilles/blob/master/inst/sql/sql_server/Achilles_v5.sql#L1498

The distribution calculations is very sensitive to indexing since the order of the data is very important (to calculate the medians/quartiles we need to have all the data in the right order so we can find the one at the 50% position, 25% position, etc).

I’ve invested a lot of time into trying to find the most optimal strategy. In this case: it is calculating the distinct observation period durations that exist in the data, creating counts of the distinct values, and then using the distinct values with counts to determine what the value is at the 10/25/50/75/90 percentiles. The idea is that it’s much faster to sort distinct duration values than millions of individual duration values.

-Chris

Chris_Knoll · July 7, 2016, 3:28am

Also, for the above, I’m assuming you are working in the CDMv5 mode of Achilles.

vojtech.filipec · July 12, 2016, 7:32pm

Thanks for noting this, Chris - we’re having data in v4, but the trouble-making SQL is almost identical.

vojtech.filipec · July 12, 2016, 7:53pm

Gentlemen, I consider this thread resolved - there have been use cases with data larger than I have, and as a convenient side-product of this discussion, @Chris_Knoll added a new functionality that generates the summarization scripts but does not execute them. Thank you for your help.

Indexing helped me to reduce the run-time; namely, I looked at which columns are used for joining in the trouble-making query and indexed them. If you use the original DDL, your indicis are likely to be ready. Our datasets come from an external vendor and they were not.
I still don’t see Achilles running, but I see a progress. There is another idea I got that might explain the issue with failing query. This is posed in another thread: Missing table / Empty table