OHDSI Home | Forums | Wiki | Github

Post ETL Statistics

I have a couple of scripts that 1) capture various statistics about a CDM dataset; 2) compares the statistics of a CDM to the prior set of statistics and displays the difference. In combination they are used to compare one ETL run to the next. My question is where to put this information? I found a git repository that looked like a potential spot, https://github.com/OHDSI/sandbox, but do not have permission to upload.

@DTorok:

What’s your Github name?

don-torok

I put 3 scripts into https://github.com/OHDSI/sandbox/tree/master/CDM_QC. We use them to collect statistics from CDMv5 after an ETL has been completed and to compare the results from one ETL to the next.

post_etl_stats_v5.sql – writes statistics to a file named cdm_stats. Examples of stats are mix/max dates, number of row per table and percent of source values mapped to standard concepts.

post_etl_review.sql – compares statistics from the current ETL to the prior. This is used as a quick check of the ETL results.
Example output
table_name description last_date prior_date last_count prior_count diff perc
care_site Rows 8/25/2017 2/14/2017 718 714 4 0.56
condition_occurrence Rows 8/25/2017 2/14/2017 303,365,816 299,297,921 4,067,895 1.34

f_formatNumber.sql – helper function to add commas to numeric values

t