Surveys (and other EAV-type data, like registries and clinical trials) are defined as question-answer (sometimes variable-value) pairs. We have 30k questions and 40k answers today in the Vocabularies. But that very construct causes trouble for our way of standardized representation of facts and querying them, because:
- They are spread over Measurement and Observation domains (questions) and Drug, Meas Value, Observation, Procedure domains (answers).
- They mostly are defined locally, and not independently, but there are some standard ones.
- The separation into question and answer is arbitrary. For example, among the existing survey concepts there are 91 questions and 102 answers containing the word “Diabetes”, usually in the context of current, historic, family, treatment or complication.
- Questions, but especially answers are highly repetitive. For example, there are 2941 questions with one possible answer “No”, to be picked from three different “No” concepts. There are 12 different answer concepts for “White” and “Black”. Do these mean the same or are they all of a different meaning?
- There is a ton of the usual junk like flavors of null, “Other”, “Obsolete”, etc.
It’s a mess. ATLAS really cannot use them. So, the question is now what do we want to do with them:
- Keep adding questions and answers as standard concepts (hopefully with some domain cleanup)
- Clean them up, consolidate identical answer concepts and pre-coordinate them with the questions
- Create non-standard or 2B concepts, so they can be used for local querying
- Kick them out of the Vocabularies completely and have local non-CDM tables for them
I don’t have a strong tendency, but if I had to pick I would do the second. That way, at least we don’t have ambiguity in the meaning of things. 3) would require the addition to a source_value_concept_id to the Observation table.
But what I really would recommend is a good mapping to proper domain concepts. For example, resolve the LOINC question-answer pair Other health condition- Med: Diabetes mellitus to the Condition concept Diabetes mellitus. Only then, the information in the surveys becomes queryable in a standardized way, like in ATLAS.
However, there is a but: It’s a lot of work, and who is going to do it? And what if we cannot find an existing standard concept, like for the ones @mcantor2 wants. Are we going to create new ones?
Thoughts?