ISO Best Practices of CDM Indexing

krfeeney · May 21, 2020, 6:17pm

Hi All!

As we run these COVID-19 studies, we’ve got a lot of collaborators looking for ways to gain efficiency in the use of their local CDMs. I am looking for any community tips on best practices for CDM indexing across all RDMS. Putting a shout out to my friends: @clairblacketer @Frank @fdimartini @anthonysena @gregk @mgkahn @rtmill and anyone else who may have advice on this topic.

I think this would make for a great topic in the Book of OHDSI v2.0

Best,
Kristin

mgkahn · May 27, 2020, 1:14am

Kristin: Be it a blessing or a curse is for others to decide, but Google BigQuery has no indexing. It does massive parallel full table scans and intensive caching, all under the hood without us having to manage anything. So if you don’t want to bother with indices, move to GBQ.

Michael

gregk · May 27, 2020, 2:31pm

Well, as @mgkahn said in the message above - some database do require indexes, and some are not utilizing indexes. So, the answer on optimization is rather database specific.

Even though you specifically mentioned RDBMS, here some more general best practices:

Optimize your hardware first, where needed e.g. use SSD vs. HDD, have enough RAM (fine tune your database to take advantage RAM for caching and temp processing). In RedShift, there are choices of dense storage and compute storage. Chose more expensive but much faster compute storage if speed is required and can afford it.
Size your storage properly, leave enough temp space. For example, AWS RedShift is recommending at least 40% free space
If databases require indexes, create indexes after you load your data. This will not only allow you get better performance but also ensure the referencial constraints are in place and data is valid
Compute stats on your data
in AWS RedShift, make sure your data is properly distributed and sorted across nodes (as per RedShift DDL v5.3.1). Compression on certain columns also helps.
Just a note - as Michael said it - Google BigQuery does not require manual optimization and sizing. it is a database that was designed to automatically scale depending on a load and size of data (and both)

Hope this helps

mgkahn · May 27, 2020, 2:53pm

In Point #3, @gregk highlights at one aspect where others would consider GBQ’s lack of indices to be a “bad thing” – you cannot use indices to automatically enforce referential integrity. Those checks would have to occur as a separate step as part of your post-ETL data quality checks.

fdimartini · May 27, 2020, 4:47pm

Interesting topic @krfeeney. I would like to start off by saying that I feel like there should be indexing based on your needs. For example, the indexes on the github are valuable for the etl conversion and referential integrity. Where they fall short is considering the intentions ATLAS has with the data. In short, two sets of indexes (one for ETL, one for research) would be ideal; assuming you can have two copies of your data. There will definitely be some overlap in the indexes, but the main point is that what you intend to do with the data (i.e. process or perform analytics) will depend on which set of indexes you use.

The bigger problem is writing the best query for your use case while using the analytics index correctly. This is an entirely separate issue, and would have to be hashed out before you created the index because of many different variables (i.e. join vs. where statement vs. exists, vs subquery etc.). On top of that, it all has to work on many different platforms that may not play nice with syntax from another platform. I don’t think that everyone has the ability to go to BigQuery, so I could see the value in creating database platform specific indexes, but it all comes down to prioritization.

Ultimately I could potentially see a query performance review committee when studies are about to run to ensure performance and optimization are factored in based on the query used. Would be ideal for someone on that committee to have access to multiple database platforms with omop data to perform some testing. As they start to see commonalities, a standard group of indexes are formed for analytics, while special use cases could require “1-time” indexes and/or query improvements for example. This is just one idea, would love to hear some other input on this.

clairblacketer · May 29, 2020, 2:30pm

Hi @krfeeney

This is a great question and one I have seen come up on the CDM github a couple times in the past weeks. @mgkahn’s point is well taken that this often has to be done at the discretion of the site and is often dependent on their infrastructure. At J&J, for instance, we have a PDW appliance that stores CDM tables in a clustered columnstore index and we apply some additional indices for vocabulary and some results tables based on our experience and needs.

What I have tried to do on the CDM github is provide a sample set of indices for each dbms that we support to offer some guidance for those folks who may be new to OHDSI. It can be difficult to know up front which indices you may need to enhance query performance if you are unfamiliar with the columns and how data are stored in the model. One of the problems I have run into is that it can be difficult to write an index for a dbms I do not have access to for testing. Right now I can test Oracle, Postgres, Sql Server using OHDSI resources and Redshift, and PDW using J&J resources. The two I do not have access to are Impala and Netezza and I rely on sql render to create these once I have tested and fixed syntax for the others.

With that being said, here is my suggestion: at the very least we know that columns containing concept ids and other ids (think person_id, visit_occurrence_id) need to be indexed along with some of the vocabulary tables. It has been an action item on my list for a while to address the indices available on the github so I can take a first pass but if there are any existing indices in use by the community that I can leverage, that would be great. I would also appreciate any volunteers to test the indices or review them for errors.

I could also see adding a article to the CDM website detailing which indices should be created and how they enhance performance. I am not a db admin so if anyone has the knowledge to provide some text on that, I think it would go a long way.

Andrew · May 29, 2020, 2:51pm

My naive understanding is that the advantages of clustered columnstore indexing for vanilla SQL Sever are most likely to apply fact tables, because they tend to require full table scans rather than table seeks. If so, the vocab portions of the CDM seem like they might be the kind of fact tables clustered columnstore indexes are meant to optimize. The current OHDSI DDL and indexing for SQL server doesnt create these or any other OMOP tables as clustered columnstore indexes.

This page covers some of the relevant info on columnstore indexing alone and in combination with rowstore indexing for data warehousing scenarios for SQL Server .