Survey vocabularies in OMOP

Alexdavv · March 17, 2021, 1:37pm

To some extend as every other source vocabulary we’ve tackled.
But the real reason why we can’t…

…is thousands of duplicates - when every distinct piece of valuable information is recorded in many various ways.

It’s even worse than keep general (non-EAV-type) vocabularies being all Standard - there’s no way to build a cohort out of UKB, PPI, NAACCR, and CAP. You simply can’t know all the stuff underlined. While it’s quite possible with ICD, MedDRA, or Read, agree?

Not the actual option since pre-coordination is only a way to represent the combinations. But you still have the same number of combinations to deal with.

Not simple as that. Half of the EAV-type data imply the real meaning in the answer, while another half - in the question.
So we have to consolidate twice. And then dedup on Q-A vs A-Q principle.

And even though we did this huge job, how would these very specific “ever in your life / never ever / past history / at what age / in childhood / within the last x years, etc.” be useful?

Ok, we can dedup them between each other, but they’re still counterparts of the real OMOP “History of… / Condition” instances.

What we really need to do is to map them to the general OMOP-like instances.

Sounds good, but requires us to take the actions mentioned above (new fields, new survey data convention). And not only for the Observation table. I think we might want to follow the link from the source EAV-type to the target condition_occurrence record.

2B might be a case for the vocabularies that are not used among multiple institutions, while others may become a part of OMOP once it would be useful for the community.

Well. What would be a CDM part then for many EAV-type sources? Person table?

How about this:

Introduce the _value_source_concept_id and _value_source_value fields to the required tables. Make these fields visible for ATLAS.
Syncronyze the field list between the wide mapping tables and event tables. Get rid of concatenated codes/values in ETL - make JOINs on multiple fields instead.
Change the convention and start treating EAV data (including Survey data) as non-Standard source data within a new Domain. Consistently deStandardize it in the vocabularies (including LOINC, PPI, UKB, NAACCR, etc.). Ask users to aim to the _source fields in their specific queries, but using the concept_ids, not the source_values.
On the use case basis start mapping of EAV data to OMOP Standart instances. Introduce new concepts like for Smoking or Lab tests (or vocabularies - like Cancer Modifier) if needed.

That is a tricky question. But what if we change the concept of the “source data” a little bit:

if the data is dataset-specific, we can make it OMOPed and queriable by using the source_concept_ids (2B or in Athena - see the rule above);
once we identify the same data in the multiple sources and a need to query it, it becomes a Standard within a well-curated vocabulary or OMOP-generated stuff.