What is the general order of creating CDM tables?

mccullen_j · August 7, 2023, 8:06pm

I read in the Book of OHDSI that the “ETL Flow” is generally to create the person table first followed by observation_period, dimensional tables (provider, care_site, etc.) then visit and clinical event tables.

However, the person table has a provider_id and location_id field, so wouldn’t you need to create those first? I’m using rabbit in a hat to document the ETL and was starting with person but am unsure what to fill in for provider_id given that it is not sourced for the source database, but is contingent upon the cdm.provider table being created.

Mark · August 7, 2023, 8:23pm

Build person, provider and care site tables, as temp tables first, then combine them into proper tables. The temp tables do not require all the data and are a great way to sort the data without doing very expensive update statements.

Of course, the size of your ETL, and the underlying DB, will dictate how expensive updates to tables are.

mccullen_j · August 8, 2023, 3:00pm

So you would create them as temp tables, update them, and then make them real tables?

Is there a particular reason why you wouldn’t just create the provider/care_site/location first?

Mark · August 8, 2023, 3:16pm

This may be an artifact of the EHR that we have, I need to create all three independent and then find the links between the said three tables. We have many locations across the state of Tennessee and our patients are not assigned to either providers nor to locations. Not all locations, nor all providers, will even be loaded into the OMOP tables. Yes, I could do deletes after, but delete statements are a code smell.

If you have a simpler ETL, perhaps what you are doing will be fine; on the other hand, building everything first into temp tables should work regardless the complexity of the ETL.

EDIT: to add context, we are building against an AOU grant, therefore we cannot do a complete OMOP load as this would be a privacy violation.