Calculating Percentiles in large datasets

Chris_Knoll · June 15, 2015, 6:36pm

Hi, community,
I’m looking for some validation on an approach to calculating the p10, p25, median, p75, p90 of extremely large datasets. Our current approach consisted of row_number() each individual row (which could be 10 bil.) which leads to a very large in-database sort, such that we calculate the percentile position of EVERY row, and then collspse that result set into the p10, 25, p50 (median), p90 values. This takes several hours to compute on some of our larger datasets.

Alternative approach:
It turns out the values that we are trying to find distributions of is a very small distinct number (compared to the billons of rows). Example: age at first diagnosis…could be between 0 and 100. Or Visit Occurrence duration. Mabye between 0 and 780 days? So, the idea is to take advantage of the relatively few distinct values, and group-by-count them together, sort these aggregates, and use a little ‘prior row accumulated’ magic to figure out which row in the aggregated set would fall within the desired result percentile bucket. Here’s a concrete example:

Assume 100 rows (to make percentile calculation simple) with the following distint aggregate values:

value   count
1         2
2         4
3         7
4         12
5         19
6         23
7         16
8         11
9         5
10       1

We know that we have 100 rows, so the above counts sum up to 100. We can determine that the 25th percentile is whatever row would have been at the 25th position in the un-aggregated result, but the point is that we don’t want to sort 100 items, we just want to sort these 10 distinct values. So the idea add an ‘accumulated’ row to this result set that contains the sum of all the prior values:

value   count prior_sum
1       2     2
2       4     6
3       7     13
4       12    25
5       19    44
6       23    67
7       16    83
8       11    94
9       5     99
10     1      100

So, To find the 25th percentile value, we want the 25% row position of our total, which I made this easy with our 100 row example, so that’s the 25th row. We can see the value 4 accumulates up to the 25th row, so our 25th percentile is
4 in this sample set. How about 50th percentile? We don’t have an exact 50 match, but we do see 6 was from row 45 to 67 (the 5s ended at position 44). So it looks like the logic that would work is to find the MIN() of the VALUE column where the ROW_POSITION (Percentile/100 * TotalRows) is >= the accumulated value. Using this logic, the values 6,7,8,9.10 satisfy this, so the min value from this is 6. So 6 is the value of the 50th percentile (median) in this set. And so on and so on.

Do any of our Math experts out there have any opinions on this approach? Anything I am not seeing that would cause this logic to fail?

Here’s the SQL to make this work for finding the disribution of visit duration from the visit_occurrence table using mssql dialect (but this can be translated via SQLTranslate):

with overallStats (visit_concept_id, avg_value, stdev_value, min_value, max_value, total) as
(
	select visit_concept_id, 
		avg(datediff(dd,visit_start_date,visit_end_date)), 
		stdev(datediff(dd,visit_start_date,visit_end_date)), 
		min(datediff(dd,visit_start_date,visit_end_date)) as min_value,
    max(datediff(dd,visit_start_date,visit_end_date)) as max_value,
		COUNT_BIG(*) as total 
	from dbo.visit_occurrence 
	group by visit_concept_id
),
durationStats (visit_concept_id, duration, total, rn) as
(
	select visit_concept_id, datediff(dd,visit_start_date,visit_end_date) as duration, count(*) as total, row_number() over (partition by visit_concept_id order by datediff(dd,visit_start_date,visit_end_date)) as rn
  from dbo.visit_occurrence 
	GROUP by visit_concept_id, datediff(dd,visit_start_date,visit_end_date)
),
durationStatsWithPrior (visit_concept_id, duration, total, accumulated) as
(
	select s.visit_concept_id, s.duration, s.total, sum(p.total)
	from durationStats s
	join durationStats p on s.visit_concept_id = p.visit_concept_id and p.rn <= s.rn
	group by s.visit_concept_id, s.duration, s.total, s.rn
)
select p.visit_concept_id as stratum_1,
	o.total as count_value,
	o.min_value,
	o.max_value,
	o.avg_value,
	o.stdev_value,
	MIN(case when p.accumulated >= .50 * o.total then duration end) as median_value,
	MIN(case when p.accumulated >= .10 * o.total then duration end) as p10_value,
	MIN(case when p.accumulated >= .25 * o.total then duration end) as p25_value,
	MIN(case when p.accumulated >= .75 * o.total then duration end) as p75_value,
	MIN(case when p.accumulated >= .90 * o.total then duration end) as p90_value
from durationStatsWithPrior p
JOIN overallStats o on p.visit_concept_id = o.visit_concept_id
GROUP BY p.visit_concept_id, o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
order by p.visit_concept_id

Thanks.

schuemie · June 18, 2015, 6:22am

I’m sure I don’t count as a ‘Math expert’, but I know that some of the low-mem implementations for finding percentiles work this way (sometimes discretizing continuous values to some minimum required level of precision). I think this is an excellent idea, and guess the performance would be way better than what we currently have!

Chris_Knoll · June 18, 2015, 6:38am

Great, thanks, I just wasn’t sure if there was a sort of ‘average of averages’ force at play here that I might have overlooked, but yes, you hit it right on the head: it’s making a long series a shorter series of numbers (with counts) so sorts and memory consumption is reduced. Still experiencing some challenges in certain cases, but I feel this gets me in the right direction.

-Chris

Chris_Knoll · July 2, 2020, 11:35pm

I wanted to provide an update to this code that is more concise, and leverages a window function to limit the use of a self-join. Feel free to try this form:

with rawData (count_value) as
(
	select datediff(d,drug_era_start_date, drug_era_end_date) as count_value
	from dbo.drug_era
	where drug_concept_id = 1125315
),
overallStats (avg_value, stdev_value, min_value, max_value, total) as
(
  select
    CAST(avg(1.0 * count_value) AS FLOAT) as avg_value,
    CAST(stdev(count_value) AS FLOAT) as stdev_value,
    min(count_value) as min_value,
    max(count_value) as max_value,
    count_big(*) as total
  from rawData
),
priorStats (count_value, total, accumulated) as
(
  select count_value, 
	count_big(*) as total,
	sum(count_big(*)) over (order by count_value) as accumulated
  FROM rawData
  group by count_value
)
select 
	o.total as count_value,
	o.min_value,
	o.max_value,
	o.avg_value,
	o.stdev_value,
	MIN(case when p.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
	MIN(case when p.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
	MIN(case when p.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
	MIN(case when p.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
	MIN(case when p.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
from priorStats p, overallStats o 
GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;

Note, in the original query, it was producing median duration by visit_concept_id, but in this form it’s just displaying median (and perecntile ranges) of duration of drug eras). It was a little clunky how the visit duration was interweaved with the actual median calculation, so I changed it here, but to produce any median from a set of values, just change the rawData cte to produce a list of values that you want to find the median for. This style of query has shown to run very efficiently on very large datasets.

Vojtech_Huser · July 7, 2020, 4:38pm

Cris, would this also work for lab results (numeric value). (not whole integers)

For our DQD threshold study, I used the old SQL approach and this new trick should theoretically work too. What do you think?

Chris_Knoll · July 7, 2020, 6:35pm

The key to its performance is a uniqueness in values in the entire population of the values you are trying to find the median for. Meaning, if you have 100 million people, the expected unique values for age within this population is the integer values between 0 and 100. So, why sort 100 million age values when you can group up counts by the age value and then use the above approach to find the value of a median (or any other precentile) within the population by only sorting on the unique values? That’s the key.

So, to your question on numeric values, let’s say you have HBA1C values which typically range 5% to 12%, or in mmol/L 5.4 to 16.5. provided your values are rounded to a certain level of precision, you can assume a few distinct values within that population where this approach would be effective. If you saw 5.40001 5.40002 5.4003 etc in your range this may not be effective if there are a large number of unique values (you don’t get the sorting benefit). You could round the set to 2 decimals, and then your unique values range from 5.40 - 12.00 which is 660 unique values, which is much easier to sort than 100 million distinct lab values (assuming you’re dealing with 100 million lab values).

But the algorithm will work on decimal as well as whole integer values…it could also work on categorical values if you could sort the categories in a certain way such that you the ‘order’ of those categories and can arrange the population ordered on a category (but I’m not sure this would be appropriate use of catagorical data).

-Chris