I wanted to provide an update to this code that is more concise, and leverages a window function to limit the use of a self-join. Feel free to try this form:
with rawData (count_value) as
(
select datediff(d,drug_era_start_date, drug_era_end_date) as count_value
from dbo.drug_era
where drug_concept_id = 1125315
),
overallStats (avg_value, stdev_value, min_value, max_value, total) as
(
select
CAST(avg(1.0 * count_value) AS FLOAT) as avg_value,
CAST(stdev(count_value) AS FLOAT) as stdev_value,
min(count_value) as min_value,
max(count_value) as max_value,
count_big(*) as total
from rawData
),
priorStats (count_value, total, accumulated) as
(
select count_value,
count_big(*) as total,
sum(count_big(*)) over (order by count_value) as accumulated
FROM rawData
group by count_value
)
select
o.total as count_value,
o.min_value,
o.max_value,
o.avg_value,
o.stdev_value,
MIN(case when p.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
MIN(case when p.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
MIN(case when p.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
MIN(case when p.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
MIN(case when p.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
from priorStats p, overallStats o
GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;
Note, in the original query, it was producing median duration by visit_concept_id, but in this form it’s just displaying median (and perecntile ranges) of duration of drug eras). It was a little clunky how the visit duration was interweaved with the actual median calculation, so I changed it here, but to produce any median from a set of values, just change the rawData
cte to produce a list of values that you want to find the median for. This style of query has shown to run very efficiently on very large datasets.