OHDSI Home | Forums | Wiki | Github

Extension columns for ETL auditing & lineage

(Tim Quinn) #1

We have an interest in using the CDM as the basis for a research data warehouse (as perhaps others do also). Therefore, we are “extending” the CDM to include columns that pertain specifically to the ETL processes that load and update records in our tables. They include columns for transparent data lineage (back to records in the source) and version history/change data capture (CDC). As a contribution to the community, I’m sharing our data warehouse ETL columns (see attachment). Please note that the data types are specific to SQL Server.
etl_audit_columns_OMOP.xlsx (10.3 KB)

(Tim Quinn) #2

Not apparent from this simple list of ETL columns is our approach for capturing the version history of records. As an example: We have a person_version table that mirrors the person table, except that person_id is a foreign key and etl_person_version_id is the primary key. The current version of each record is stored in the person table, while every historical (ie, superseded) version of a record is stored in the person_version table.

We detect a record version change by computing the hash value of a newly arrived record from the source and comparing its hash value to the one currently in the person table. Sometimes new source records do not result in a new OMOP record version. Why? Because we load only a subset of columns from the source record and we care to capture the change history of only a subset of those columns.

The rationale for this approach is apparent: Most users only care about the “latest and greatest” version of each record in the person table. Only in rare cases do we need to query the change history. Consequently, we save on storage by not saving the latest version in person_version, at the expense of needing a UNION operation when we do need the full history.

(Frank D) #3

Hi Tim,

Looks like you have a good handle on the ETL process. Was curious how often you run your refresh process and you how handle updates to the vocabulary? Also, how do you handle fact/event tables, do you use a hash key as well?

(Tim Quinn) #4

Thanks for the note, @fdimartini. Full transparency: we are only just starting our OMOP project, so we haven’t yet worked out many of the necessary details. However, several of our project team’s members have years of data warehousing experience, so we’re doing our best to leverage our hard-won lessons from the past when designing our OMOP ETL pipeline.

Our refresh process will be daily and, as you can perceive from my prior post, it will consist of an incremental or “delta” load, rather than a full truncate-and-reload every time.

My plan was to use the same CDC approach to the event tables (eg, visit_occurrence, measurement) as we’ve designed for the “master file” tables (eg, person, provider, location), even though we expect much lower volumes of record updates in these tables. The rationale is as follows: we’re building a metadata-driven process that makes it easy to apply the same ETL steps uniformly across all tables in the model.

We haven’t yet tackled vocabulary refreshes. We might decide not to apply the CDC approach to these tables, simply because the content is published by OHDSI on ATHENA and, therefore, is more controlled than other, more volatile sources.

(Frank D) #5

I’ve always been a fan of a metadata driven process - makes everyone’s lives easier.

If you are taking suggestions for updating the vocab, there are a few approaches - you can just keep the existing vocab and update on a monthly/quarterly basis, or scan new vocab and compare each domain xxx_concept_id to see if the upgrade impacts you.

The only issue with refreshing the full dataset with a new vocab is generating new identity columns (domainname_id), which may impact your research team if they reference them. Something to consider.

But ultimately, it is nice to hear how community members are handling their incremental processes, it isn’t a regular topic on the forums.