I’m looking for some validation on an approach to calculating the p10, p25, median, p75, p90 of extremely large datasets. Our current approach consisted of row_number() each individual row (which could be 10 bil.) which leads to a very large in-database sort, such that we calculate the percentile position of EVERY row, and then collspse that result set into the p10, 25, p50 (median), p90 values. This takes several hours to compute on some of our larger datasets.
It turns out the values that we are trying to find distributions of is a very small distinct number (compared to the billons of rows). Example: age at first diagnosis…could be between 0 and 100. Or Visit Occurrence duration. Mabye between 0 and 780 days? So, the idea is to take advantage of the relatively few distinct values, and group-by-count them together, sort these aggregates, and use a little ‘prior row accumulated’ magic to figure out which row in the aggregated set would fall within the desired result percentile bucket. Here’s a concrete example:
Assume 100 rows (to make percentile calculation simple) with the following distint aggregate values:
value count 1 2 2 4 3 7 4 12 5 19 6 23 7 16 8 11 9 5 10 1
We know that we have 100 rows, so the above counts sum up to 100. We can determine that the 25th percentile is whatever row would have been at the 25th position in the un-aggregated result, but the point is that we don’t want to sort 100 items, we just want to sort these 10 distinct values. So the idea add an ‘accumulated’ row to this result set that contains the sum of all the prior values:
value count prior_sum 1 2 2 2 4 6 3 7 13 4 12 25 5 19 44 6 23 67 7 16 83 8 11 94 9 5 99 10 1 100
So, To find the 25th percentile value, we want the 25% row position of our total, which I made this easy with our 100 row example, so that’s the 25th row. We can see the value 4 accumulates up to the 25th row, so our 25th percentile is
4 in this sample set. How about 50th percentile? We don’t have an exact 50 match, but we do see 6 was from row 45 to 67 (the 5s ended at position 44). So it looks like the logic that would work is to find the MIN() of the VALUE column where the ROW_POSITION (Percentile/100 * TotalRows) is >= the accumulated value. Using this logic, the values 6,7,8,9.10 satisfy this, so the min value from this is 6. So 6 is the value of the 50th percentile (median) in this set. And so on and so on.
Do any of our Math experts out there have any opinions on this approach? Anything I am not seeing that would cause this logic to fail?
Here’s the SQL to make this work for finding the disribution of visit duration from the visit_occurrence table using mssql dialect (but this can be translated via SQLTranslate):
with overallStats (visit_concept_id, avg_value, stdev_value, min_value, max_value, total) as ( select visit_concept_id, avg(datediff(dd,visit_start_date,visit_end_date)), stdev(datediff(dd,visit_start_date,visit_end_date)), min(datediff(dd,visit_start_date,visit_end_date)) as min_value, max(datediff(dd,visit_start_date,visit_end_date)) as max_value, COUNT_BIG(*) as total from dbo.visit_occurrence group by visit_concept_id ), durationStats (visit_concept_id, duration, total, rn) as ( select visit_concept_id, datediff(dd,visit_start_date,visit_end_date) as duration, count(*) as total, row_number() over (partition by visit_concept_id order by datediff(dd,visit_start_date,visit_end_date)) as rn from dbo.visit_occurrence GROUP by visit_concept_id, datediff(dd,visit_start_date,visit_end_date) ), durationStatsWithPrior (visit_concept_id, duration, total, accumulated) as ( select s.visit_concept_id, s.duration, s.total, sum(p.total) from durationStats s join durationStats p on s.visit_concept_id = p.visit_concept_id and p.rn <= s.rn group by s.visit_concept_id, s.duration, s.total, s.rn ) select p.visit_concept_id as stratum_1, o.total as count_value, o.min_value, o.max_value, o.avg_value, o.stdev_value, MIN(case when p.accumulated >= .50 * o.total then duration end) as median_value, MIN(case when p.accumulated >= .10 * o.total then duration end) as p10_value, MIN(case when p.accumulated >= .25 * o.total then duration end) as p25_value, MIN(case when p.accumulated >= .75 * o.total then duration end) as p75_value, MIN(case when p.accumulated >= .90 * o.total then duration end) as p90_value from durationStatsWithPrior p JOIN overallStats o on p.visit_concept_id = o.visit_concept_id GROUP BY p.visit_concept_id, o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value order by p.visit_concept_id