OHDSI Home | Forums | Wiki | Github

Writing queries that return a set of random results

After checking the documentation, I have noticed that SQLRender does not provide a translation from MS SQL Server queries to select N random rows from the result set (https://github.com/OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv), in the shape of

SELECT TOP 1 column FROM table ORDER BY NEWID()

Any particular way that people are getting around this issue?

As background, I need to select N random patients from a subset of patients that is sometimes in the millions. I don’t want to bring the whole list of patient_id’s to R and then do a random sample from there. Is there a recommended way of doing this?

Thanks!

Hi @Juan_Banda . Sorry for not responding earlier. Even though SqlRender currently doesn’t support this doesn’t mean we can’t add it. You’d need to change your syntax to something that can be translated to Oracle (which doesn’t have an equivalent of TOP):

SELECT column FROM 
   (SELECT column, ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn FROM table) tmp 
WHERE rn <= 1

with some rules to translate NEWID() we’d be good to go on PostgreSQL:

SELECT column FROM 
   (SELECT column, ROW_NUMBER() OVER (ORDER BY RANDOM()) AS rn FROM table) tmp 
WHERE rn <= 1

and also on Oracle:

SELECT column FROM 
   (SELECT column, ROW_NUMBER() OVER (ORDER BY DBMS_RANDOM.VALUE) AS rn FROM table) tmp 
WHERE rn <= 1

Let me know if that will work for you, and I’ll implement it in SqlRender.

Hello,

This looks great and would work for any random case. Thanks for responding
and adding the functionality to the package.

Unfortunately, the NEWID() / RANDOM() / DBMS_RANDOM.VALUE doesnt currently
have an analog in Microsoft APS.

We’ll probably want to think about when a truly ‘random’ selection needs to
be made, vs. when the idea is to simply select a sample of any sort (in
which case, you could use the same trick as you’ve outlined here to create
a row_number and then select from that, but instead of ordering by a random
number, you could order by something else (person_id, _id, date, etc.)

Yes, another approach would be to use a hash function. This would make the selection ‘random’ (ie you wouldn’t select the persons with the lowest person_ids), but 100% repeatable. So for example in SQL Server:

SELECT column from (SELECT column, ROW_NUMBER() OVER (ORDER BY HASHBYTES('MD5',CAST(person_id AS varchar))) RN FROM table) tmp WHERE rn <= 1

Postgres and Oracle have the equivalent MD5() and DBMS_CRYPTO.HASH() functions respectively.

I added translation for both the RAND() and HASHBYTES() functions. You now have the choice between two evils:

Evil 1:

SELECT column FROM
(SELECT column, ROW_NUMBER() OVER (ORDER BY RAND()) AS rn FROM table) tmp
WHERE rn <= 1

will use true random selection. This code will work fine on most platforms, but cannot be translated to Microsoft PDW which does not support random number generation.

Evil 2:

SELECT column FROM
(SELECT column,
ROW_NUMBER() OVER (ORDER BY HASHBYTES('MD5',CAST(person_id AS varchar))) AS rn
FROM table) tmp
WHERE rn <= 1

This is pseudo-random selection. It will run on all platforms including PDW. However, on Oracle the user will need to be granted special privileges to use the hash function, and this has to be done by the DBA.

I added a section to the vignette describing all of this.

Great! Thanks for the options! I will make good use of them, and test them
in the process.

Hi ,

I have used Evil 2 Query however sometimes it does not display random rows from PDW table. Please let me is only the way to populate random rows from existing PDW table.

Thanks & Regards,

This just in: https://blogs.technet.microsoft.com/dataplatforminsider/2016/11/28/microsoft-releases-the-latest-update-to-analytics-platform-system/

Looks like PDW has support for RAND() and NEWID() now.

-Chris

Awesome! I will add this to SqlRender. Thanks @Chris_Knoll!

t