OHDSI Home | Forums | Wiki | Github

Data Quality Dashboard for big databases

Executing DQD for partial loads of the database was quite fast. However, executing it for full database loads (e.g. 1billion measurements) takes a lot of time (it has been more than a week over the second “Connecting using postgresql driver” prompt). Database has all indexes created, but seems like the joins over a billion rows are taking a while.

Is there anything I can tweak on the DQD configuration to get at least some results in a reasonable time? (I’m assuming that I’ll have to pass DQD several times to fix the errors, but I’m worried it will take months just to check every little quality improvement).

Have you optimized the Postgres server for data analytics? I highly recommend using PGTune to optimize the performance of your PostgreSQL database. Choose ‘Data warehouse’ as DB type. Once you’ve selected your system specifications, PGTune provides two ways to update your configuration: by manually editing your postgresql.conf file, or by issuing ALTER SYSTEM commands via a SQL client (like pgAdmin). I personally find the latter easier to use. Don’t forget to restart your server for the changes to take effect.

Of course, all indices defined in the DDL files must be created, and on Postgres you should certainly call ANALYZE on all tables to improve performance.

In the end, there are limits to how performant Postgres is for large databases. That is why commercial alternatives exist.

Thanks, I will try PGTune. I’m not the one writing the Data Quality Dashboard queries so I cannot really tell what can be improved performance-wise (i.e. DQD analyzes most of the fields in the omop database)

If you have applied database constraints for Primary and Foreign keys and NOT NULL columns you can exclude these DQD checks.

1 Like
t