MIssing table / empty table

As described in my previous thread (link: The biggest use case for Achilles), I am having struggles to finish the Achilles calculation on a database with the PATIENT table containing 18 million rows. The DB is missing four tables from the OMOP CDM specification, namely CARE_SITE, ORGANIZATION, OBSERVATION, COHORT. We purchase the data from the United BioSource Corporation (UBC) and they explained to me that due to anonymisation, these tables cannot be filled in.

Achilles can’t run without these tables. Therefore, to work around this, I created four empty tables using the OMOP DDL script. The query is not finishing anyway. Do you think the empty tables matter?

No, empty tables should not matter, in fact it should run very fast. I wonder if there’s a silent failure that is causing the program to hang. Could you post the script that is leading tot he hang? Can you narrow it down to a specific query?

For another problem, I used a trick that you re-program the achilles() function in R into a custom function.

Change the code from executeSql(xxx) to executeSql(xxx,profile=TRUE) and this will tell you how long each query takes. (and point you to where the process hangs).

I wish this would be baked in into the current function (have profile=T vs F option in the package function)

Also, it is not unusual for achilles to take 8-26 hours to execute.

restrict the set of analyses to make it go faster like so: https://github.com/OHDSI/Achilles/blob/master/notes.md#how-to-save-time-on-running-full-achilles---make-it-finish-much-earlier

UPDATE: Below I describe two cases: an empty table, and a missing table. I overcame the issue with an empty table (it took 4 hours to finish that query), only the comment with missing table applies. (I contacted @Chris_Knoll separately to let him know.)

original post:
Thank you for your comment, @Chris_Knoll. I have to distinguish two things here:

An Empty Table:

Once I performed the step “A Missing Table” below, i.e. I created an empty table in the database I am getting stuck at 54% (when tables are indexed) with this query running and never finishing:

insert into CDM_GE.ACHILLES_results_dist (analysis_id, count_value, min_value, max_value, avg_value, stdev_value, median_value, p10_value, p25_value, p75_value, p90_value) 
SELECT 105 as analysis_id, 
	COUNT(count_value) as count_value, 
	min(count_value) as min_value, 
	max(count_value) as max_value, 
	avg(1.0*count_value) as avg_value, 
	STDDEV(count_value) as stdev_value, 
	max(case when p1<=0.50 then count_value else -9999 end) as median_value, 
	max(case when p1<=0.10 then count_value else -9999 end) as p10_value, 
	max(case when p1<=0.25 then count_value else -9999 end) as p25_value, 
	max(case when p1<=0.75 then count_value else -9999 end) as p75_value,
	max(case when p1<=0.90 then count_value else -9999 end) as p90_value 
FROM (SELECT ( op1.observation_period_end_date - op1.observation_period_start_date) as count_value, 
          1.0*(row_number() over (order by ( op1.observation_period_end_date -op1.observation_period_start_date)))
          /(COUNT(*) over() + 1) as p1 
  FROM CDM_GE.PERSON p1
       inner join 
        (SELECT person_id, 
        OBSERVATION_PERIOD_START_DATE, 
        OBSERVATION_PERIOD_END_DATE, 
                ROW_NUMBER() over (PARTITION by person_id order by observation_period_start_date asc) as rn1 
        FROM CDM_GE.OBSERVATION_PERIOD ) op1 
     on p1.PERSON_ID = op1.PERSON_ID 

This is the query that is hanging in the air - in our settings, it joins an 18 million rows of one table with a 20 millions in the other, and I indexed them both as you advised me in the other thread. As none of the empty tables is reference here I would intuitively think the “emptiness” of those four tables does not matter. However, I do not know this for sure.

A Missing Table

I started with a databse with four missing (non-existing) tables. As long as they were not there, I was getting the following error, which disappeared once I created an empty table CARE_SITE in the database:

oracle

Error: execute JDBC update query failed in dbSendUpdate (ORA-00942: table or view does not exist)
SQL:
insert into CDM_GE.ACHILLES_results (analysis_id, count_value)SELECT     9 as analysis_id,  
COUNT(p1.person_id) as count_value  FROM   CDM_GE.PERSON p1	
left join CDM_GE.care_site cs1	on p1.care_site_id = cs1.care_site_id  
WHERE  p1.care_site_id is not null	and cs1.care_site_id is null 

R version:
R version 3.2.4 Revised (2016-03-16 r70336)

Platform:
x86_64-w64-mingw32

Attached base packages:
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base

Other attached packages:
- Achilles (1.0)
- rjson (0.2.15)
- DatabaseConnector (1.4.0)
- RJDBC (0.2-5)
- DBI (0.4-1)
- SqlRender (1.1.3)
- httr (1.1.0)
- devtools (1.11.1)
- rJava (0.9-8)

I am sorry my answer took so long - I was initially trying to replicate this error, but in the end I found a stored error for a missing table message at my computer. The above message was obtained on May 10 and I’ve updated the R packages since then. For this reason you see old versions of packages in the log. I do not think it is very useful to replicate this with the newest version of Achilles - if I am wrong, let me know please.

Hey, Vojtech,
That query you listed seems to be using an older style of distribution calculations. It works, but for very large data sets might be slow. I thought I updated all the dist calculations but maybe that one never came on our radar.

I’ll prepare an update to that specific query to determine if the alternative technique I’ve come up with is going to help your query time. I’ll let you know when I’ve pushed it up to master on github. I don’t think it will be tomorrow, however, most likely thursday.

Thank you for your hints, @Vojtech_Huser. Using the profiling, I am able to see how long each query takes (which lets me subsequently enhance the data - we get them without indexes and un-sorted), and I know exactly what went wrong before a crash. Based on that, I’m going through and resolving each crashing query and that is how I move on forward.