Using HADES for in-house curation of registry data

Marko_Cavlina · August 11, 2025, 10:36am

Hi, I’d like to be able to somehow tag provenance of data in Atlas, so that different registry heads will be able to easily differentiate between their and other people’s data.
Even before we start thinking of networks and federated analyses my data stewards will demand to check and verify the data they are supposed to have under their care, that’s after basic data quality during ETL and Darwin’s onboarding checks.

So Birth registry head will want to see all Observations that were ETL-ed in the CDM from her registry, and not have to sift through Diabetes, Invalids or Hospitalizations registries, all of whom could have overlapping data like BMI measurements, marital status etc.

I know I could go with adding custom concepts route, but I’d like to know if there’s perhaps a more elegant solution to this, and I guess I’m not the first with this use case.

Christian_Reich · August 11, 2025, 12:25pm

Hi Marko:

I have full empathy for what you want to do. But you should realize that what you want is somewhat the opposite of what the CDM and Atlas are designed for. They are presenting a unified representation of medical data irrespective of the context. Or, the context is already baked into the ETL. So, instead of knowing where data are coming from so you can deal with them separately, the CDM and Atlas want to approach them in the same data source-independent way.

So, feel free to add fields to your tables or tweak Atlas, but OHDSI is about analyzing what happens to the patient, not the institution or data capture process.

MPhilofsky · August 12, 2025, 5:10pm

Add each registry as a separate data source in your Atlas instance.

Javier · August 13, 2025, 5:46am

We went the custom concepts route you mention and find it the most elegant.

We have joined more than 10 registers in to one OMOP instance. Cancer registry, primary heath care registry, secondary health registry, etc.
We created our own non-standard visit vocabulary with a code (or more) for each registry.
All the medical events from a registry are linked to a visit occurrence with its non-standard visit code in the source_visit_concept_id, and the standard visit type in visit_concept_id.

This way any standard cohort definition, for example from phenotype library, will gather patients from all the registries.
But we still can use Atlas to create a local cohort definition that only considers few registers.
For example we may want to only find the people that has a particular cancer diagnose in the cancer registry. We create a cohort definition and filter by visit occurrence with that source_visit_concept_id.
It can be more versatile, for example a cohort definition to find patients with onset first chronic disease diagnose in any registry, but they inclusion criteria to pick only these that appear in the kidney registry at some point later.

I hope this experience share helps
here is our ETL Home | FinnGen ETL to OMOP CDM

MPhilofsky · August 14, 2025, 3:39pm

Let me highlight this:

We MUST keep standard concept_ids in the standard concept_id field of every table in order to achieve reproducible RWE.

Also, in order for the above to work, you have to put those custom (> 2billion) concepts and their relationships to standard concept_ids in the Concept and Concept Relationship tables. The source_to_concept_map table is not used by Atlas. See my poster here which explains the pros/cons of the two methods for mapping non-OHDSI supported codes to standard concepts.