Extension columns for ETL auditing & lineage

quinnt · August 25, 2020, 8:44pm

We have an interest in using the CDM as the basis for a research data warehouse (as perhaps others do also). Therefore, we are “extending” the CDM to include columns that pertain specifically to the ETL processes that load and update records in our tables. They include columns for transparent data lineage (back to records in the source) and version history/change data capture (CDC). As a contribution to the community, I’m sharing our data warehouse ETL columns (see attachment). Please note that the data types are specific to SQL Server.
etl_audit_columns_OMOP.xlsx (10.3 KB)

quinnt · August 25, 2020, 8:58pm

Not apparent from this simple list of ETL columns is our approach for capturing the version history of records. As an example: We have a person_version table that mirrors the person table, except that person_id is a foreign key and etl_person_version_id is the primary key. The current version of each record is stored in the person table, while every historical (ie, superseded) version of a record is stored in the person_version table.

We detect a record version change by computing the hash value of a newly arrived record from the source and comparing its hash value to the one currently in the person table. Sometimes new source records do not result in a new OMOP record version. Why? Because we load only a subset of columns from the source record and we care to capture the change history of only a subset of those columns.

The rationale for this approach is apparent: Most users only care about the “latest and greatest” version of each record in the person table. Only in rare cases do we need to query the change history. Consequently, we save on storage by not saving the latest version in person_version, at the expense of needing a UNION operation when we do need the full history.

fdimartini · September 8, 2020, 6:45pm

Hi Tim,

Looks like you have a good handle on the ETL process. Was curious how often you run your refresh process and you how handle updates to the vocabulary? Also, how do you handle fact/event tables, do you use a hash key as well?

quinnt · September 8, 2020, 7:08pm

Thanks for the note, @fdimartini. Full transparency: we are only just starting our OMOP project, so we haven’t yet worked out many of the necessary details. However, several of our project team’s members have years of data warehousing experience, so we’re doing our best to leverage our hard-won lessons from the past when designing our OMOP ETL pipeline.

Our refresh process will be daily and, as you can perceive from my prior post, it will consist of an incremental or “delta” load, rather than a full truncate-and-reload every time.

My plan was to use the same CDC approach to the event tables (eg, visit_occurrence, measurement) as we’ve designed for the “master file” tables (eg, person, provider, location), even though we expect much lower volumes of record updates in these tables. The rationale is as follows: we’re building a metadata-driven process that makes it easy to apply the same ETL steps uniformly across all tables in the model.

We haven’t yet tackled vocabulary refreshes. We might decide not to apply the CDC approach to these tables, simply because the content is published by OHDSI on ATHENA and, therefore, is more controlled than other, more volatile sources.

fdimartini · September 9, 2020, 6:59pm

I’ve always been a fan of a metadata driven process - makes everyone’s lives easier.

If you are taking suggestions for updating the vocab, there are a few approaches - you can just keep the existing vocab and update on a monthly/quarterly basis, or scan new vocab and compare each domain xxx_concept_id to see if the upgrade impacts you.

The only issue with refreshing the full dataset with a new vocab is generating new identity columns (domainname_id), which may impact your research team if they reference them. Something to consider.

But ultimately, it is nice to hear how community members are handling their incremental processes, it isn’t a regular topic on the forums.

KJDonovan · February 4, 2021, 7:07pm

@quinnt – I am new to this forum and the OMOP CDM, but not to warehousing. I am in the process of evaluating the CDM for a client and I too am looking for ways to extend the base model with lineage attributes and other things.
I would guess that adding attributes to the tail end of some of the data table like Tim suggests is the way to go.
What don’t know is if I make such modifications to the core model will they break downstream integrations with ATLAS?
Thanks,
Kevin.

quinnt · February 4, 2021, 7:50pm

Adding extension columns to the “tail end” of the standard OMOP tables is what we’re doing, and I know of a few other organizations who are doing the same. Other organizations are creating “sibling” tables (with the same primary keys) in order to keep their standard OMOP tables “pure”.

AFAIK, extension columns don’t present a problem for ATLAS. If they did, you could always create a set of database views in a different schema to present “pure” OMOP tables to ATLAS.

KJDonovan · February 4, 2021, 7:58pm

Tim,

Thanks for quick response, very helpful - I think the route we may take is some form of shadow schema based on the CMD with the additional attribution (lineage tags, warehouse attributes) and physical constraints (namely alternate keys) I expect in a typical warehouse environment and then feed the CMD proper from that schema.

Thanks.

dlaxar · September 18, 2021, 1:24pm

I just stumbled upon this thread because we’re trying to do the same thing.

I am wondering however, not actually deleting data (but marking it as deleted) is prone to have consequences downstream (ATLAS, etc.). The way I had initially planned on solving this is by having a ‘sibling’ table and a view selecting only those with sibling.deleted != 1.

Surely, we can’t be the only one with this issue?

Chris_Knoll · September 18, 2021, 2:01pm

Standard tools will definitely not honor those ‘deleted’ columns, but you can make the approach where you have your base tables with the ‘deleted’ column and then create views per the standard CDM tables which only show non-deleted columns. Materialized views is good for this (for performance) but you would be creating copies of data.

quinnt · September 19, 2021, 8:34pm

Hello Daniel (@dlaxar),

We have adopted the approach that @Chris_Knoll described: We have a Boolean column in our physical tables akin to your proposed is_deleted column. We created a separate database schema containing only OMOP-compliant views for ATLAS that include the WHERE clause condition is_deleted = 0.

We implemented our OMOP database on Microsoft SQL Server. We configured all of our physical tables as clustered columnstore indexes (CCIs), which gives us query performance so good that we did not need to materialize our views. (Side note: Our EHR vendor, Epic Systems Corporation, does exactly the same thing in their Caboodle data warehouse.)

In our OMOP-compliant views, we excluded all our extra ETL audit and data lineage columns. ATLAS works just fine.

dlaxar · September 20, 2021, 6:51am

@Chris_Knoll and @quinnt
Thank you so much!