Calculating Percentiles in large datasets

Chris_Knoll · July 2, 2020, 11:35pm

I wanted to provide an update to this code that is more concise, and leverages a window function to limit the use of a self-join. Feel free to try this form:

with rawData (count_value) as
(
	select datediff(d,drug_era_start_date, drug_era_end_date) as count_value
	from dbo.drug_era
	where drug_concept_id = 1125315
),
overallStats (avg_value, stdev_value, min_value, max_value, total) as
(
  select
    CAST(avg(1.0 * count_value) AS FLOAT) as avg_value,
    CAST(stdev(count_value) AS FLOAT) as stdev_value,
    min(count_value) as min_value,
    max(count_value) as max_value,
    count_big(*) as total
  from rawData
),
priorStats (count_value, total, accumulated) as
(
  select count_value, 
	count_big(*) as total,
	sum(count_big(*)) over (order by count_value) as accumulated
  FROM rawData
  group by count_value
)
select 
	o.total as count_value,
	o.min_value,
	o.max_value,
	o.avg_value,
	o.stdev_value,
	MIN(case when p.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
	MIN(case when p.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
	MIN(case when p.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
	MIN(case when p.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
	MIN(case when p.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
from priorStats p, overallStats o 
GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;

Note, in the original query, it was producing median duration by visit_concept_id, but in this form it’s just displaying median (and perecntile ranges) of duration of drug eras). It was a little clunky how the visit duration was interweaved with the actual median calculation, so I changed it here, but to produce any median from a set of values, just change the rawData cte to produce a list of values that you want to find the median for. This style of query has shown to run very efficiently on very large datasets.