Storing Source Concept Mappings / Optimal Workflow

lawrenceadams · September 23, 2024, 4:38pm

Hello All,

I am wondering if anyone else has had a shared experience of mapping data source mappings. I have historically managed this by using a source_concept.csv and a source_concept_relationship.csv file which would be appended to over time.

I have avoided the source-to-concept-map table for the reasons laid out in other forum posts (e.g. a hangover/legacy table, etc).

Using CSV files with git looks like a good solution at first - until someone opens it in Excel (often leading to type/formatting/rounding issues), or adds a load of concepts with overlapping concept_ids and so on.

I think the approach is doable by one person, but scales poorly.

I have mocked up a lightweight CRUD RESTful API which is in front of a Postgres instance which allows submission/deletion of mappings easily - and others to contribute without all the issues that can occur with git and CSV: for example, automatic ID generation, constraints to ensure no duplicated mappings (enforce many to one relationships) and so on.

My question is: what are other people’s experiences of this? I have heard of people using Git for the most part, but I want to avoid for the reasons above - which the only way you get around is by using precommit/GH Workflows to check there are no duplicate entries etc: by which point one has recreated a database using GitHub and this feels wrong!

Am I overthinking this? Have others used a database to handle this?

Many thanks!

MPhilofsky · September 23, 2024, 6:47pm

Hello @lawrenceadams and welcome to OHDSI!

You are NOT overthinking this. I have experienced most of your concerns. We have been using one giant, unwieldy csv with an overly complex method to manage it for years and are currently transitioning the mappings to a table in our GBQ environment. We are still transitioning it, so results are not in. However, I’m hopeful it will be easier to manage, others will be more easily able to contribute and QA checks can be automated.

lawrenceadams · September 23, 2024, 7:17pm

Thank you @MPhilofsky - it is great to be here!

I am very glad to hear this - I thought I was going mad! I think the CSV approach is fine when you have a relatively small amount of mappings making a small cohort - but is quite brittle and breaks when scaling to multiple hospital systems etc.

Would love to hear how you get on!

deebreault · January 29, 2025, 7:44pm

Wondering if this workflow is something that Perseus supports currently?