OHDSI Home | Forums | Wiki | Github

Survey vocabularies in OMOP

In the UK Biobank WG we discussed the problem of how to deal with Survey vocabularies. Here, we want to summarize the issues and discuss the possible solutions.

Looking at the initial proposal - while we definitely moving in the right direction, we recognize that there are still more questions that need to be addressed in the long term.

In short, the plan was:

  • Among the different vocabularies de-standardize and remap the question to identical questions (to avoid duplication).

  • Among the different vocabularies de-standardize and remap the answers to identical answers (to avoid duplication).

  • Among a variety of others, leave one distinct question as a Standard concept to be used for constructing survey-like detailed cohorts for users that need to research with all the granularity of survey source data.

  • Using the synthetic concatenated concepts or new MAPPING table to deliver the mappings to the Standard entities, recognized as useful in OMOP. So that, classic OMOP cohorts will provide the value they’re designed for.

  • Introduce new relationship_id ‘Has Standard answer’ to represent the list of the Standard answers for the entire Question.

The issues we found:

  • The questions from the various vocabularies provide very different context/peculiarities, so the mapping rate (questions to questions) is really low. Most of them are gonna be unique Standard concepts that will eventually overload the Standard survey vocabulary.

  • Every specific answer list shades the meaning of the question concept, so the borderlines to be defined in every single concept.

  • Proposed modeling doesn’t address the problem that survey-like cohorts still requires referring to the source survey concepts.

Here is an example (all the concepts below never existed in the mentioned vocabularies, they are synthetic and only created to reproduce the problem):

The concerns on the slide are:

  1. ‘Has standard answer’ link is not really useful for non-Standard Question concepts since they’re not supposed to used in cohorts. For Standard ones, it may be found useful, but only in the Measurement Domain.
  2. If you don’t know the whole list of the possible answers used with a particular question concept, you can’t build universal survey-like cohorts anyway. Every time you still need to jump back to the source vocabulary to check the answer list used (and the mappings of the non-standard answers).
  3. If we put the whole list of the answers derived from many source vocabularies mapped to this particular Question concepts, we’re gonna deal with the ugly mixtures: year ranges (including the conflicting ones) together with categories of different sorts/axes. Technically it may work on the vocabulary side (even though requires a substantial effort to maintain), but not supposed to be user-friendly. Moreover, for many of the Questions you can get the full semantics of the concept only once looked into the list of its answers. Do we want to mix such sort of things?

From here the vocabulary team has started thinking about alternative solution:

The proposal we come up with is turning all the Survey Question/Answer concepts into non-standard. Meaning that we’re still able to build the survey-like detailed cohorts, but using the source_concept_id only. Mapping to the Standard valuable facts will be provided as described above.

And here we come to the CDM limitations:

  1. To build the question/answer pair-like cohorts in the _source_concept_id area, we would need to have the _value_source_concept_id field as well as value_source_value field. The former one doesn’t exist in OMOP, the latter one introduced to the Measurement table only.
    The addition of these fields will resolve the problem. Otherwise, we would need to create these ugly pre-coordinated question-answer pairs forever. From my point of view, the pre-coordination in the survey data is even worse (doesn’t represent the survey source structure, difficult to understand and maintain).

  2. When v6.0 was released (I checked this pdf), the survey_conduct convention described how the survey data should be stored. Now I can’t find anything out of it on cdm60.utf8
    Is it outdated, placed somewhere else or just lost?

It said the following:

  1. Patient responses to survey questions are stored in the OBSERVATION table. Each record in the OBSERVATION table represents a single question/response pair and is linked to a specific SURVEY/questionnaire using OBSERVATION.DOMAIN_OCCURRENCE_ID and SURVEY.SURVEY_OCCURRENCE_ID.

  2. Each response record is the response to a specific question identified by the OBSERVATION_CONCEPT_ID. This concept ID is a unique question contained in the CONCEPT table.

  3. An individual survey question can have multiple responses to a question (e.g. which of these items relate to you, a, b, c). Each response is stored as a separate record in the OBSERVATION table. The name (question) is stored as OBSERVATION_CONCEPT_ID and the value (answer) is stored as OBSERVATION_AS_CONCEPT_ID where the answer is categorical and is defined as a concept in the concept table, OBSERVATION_AS_NUMBER where the answer is numeric, OBSERVATION_AS_STRING where the answer is a free text string or OBSERVATION_AS_DATETIME.

  4. The question / answer observation record is linked to the patient questionnaire used for collecting the data using two new fields in the OBSERVATION table; DOMAIN_ID and DOMAIN_OCCURRENCE_ID.
    DOMAIN_ID for any survey related observations contains the text Survey and DOMAIN_OCCURRENCE_ID contains the SURVEY_OCCURRENCE_ID of the specific survey. This domain construct can be used for other observation groupings.

  5. The OBSERVATION table can also store survey scoring results. Many validated PRO questionnaires have scoring algorithms (many of which proprietary) that return an overall patient score based on the answers provided.
    Survey scores are identified by their OBSERVATION_CONCEPT_ID and are linked back to the scored survey using the same DOMAIN construct described.

If we move the survey vocabularies into the non-Standard area, it requires an updated convention:

  • Question-answer pairs (being non_Standard concepts) will be presented in the source_concept_id and source_value_concept_id fields only. Observation_concept_id and value_as_concept_id fields will be used for Standard target concepts only.

  • The specs says that besides others the possible answers in Survey data are numeric, free text or dates. So the observation_as_number, observation_as_string and observation_as_datetime fields are still “occupied” by the “source” Survey data and we can’t use them in the context of the entire observation_concept_id.
    We drafted a couple of examples to better understand two options:
    (a): store all the types of the survey answers in the value_source_value field only (concept_code, numeric value, datetime or string). Populate the source_value_concept_id field when the answer is a concept. The survey ansers will not be sorted out into the observation_as_number, observation_as_string and observation_as_datetime fields. These field will be used in context of the onservation_concept_id only.
    (b): always intentionally create a separate CDM record for the target standard Observation concept. So that the Survey “source” data is always captured as a separate CDM Observation record, and answers can be sorted out into the observation_as_number, observation_as_string and observation_as_datetime fields. It’s a little bit against the basic principle of data transformation in OMOP. Would it be easy to implement on the ETL side? What would be an indicator of applying such transformation or sorting between Survey source and Standard records? New ‘Survey’ Domain, being not a new table, but a characterictic of the concept, may be a choice in the long-term run.

We would like to hear other ideas and get feedback from the community.
Tagging the working group and people involved in the survey data discussions before:
@Alexandra_Orlova @Andrew @anna_corning @aostropolets @Chris_Knoll @Christian_Reich @clairblacketer @cmkerr @ColinOrr @Daniel_Prieto @Dave.Barman @Dymshyts @ellayoung @ericaVoss @gregk @Josh_R @kyriakosschwarz @lee_evans @linikujp @MaximMoinat @mcantor2 @mik @mmandal @MPhilofsky @mvanzandt


Well, since there are almost 4000 of us, let me tag some more than 25
@nick @nlw @parisni @Patrick_Ryan @QI_omop @rookie_crewkie @schillil @SCYou @SELVA_MUTHU_KUMARAN @spiros @Vimala_Jacob @Vojtech_Huser @zhuk

That was a great read, @Alexdavv. I appreciate your attention to standardization by finding something that will work across networks. I realize that the example in question is about smoking, but I wonder if the OMOP cdm should have a standard set of age categories for which all those ‘source’ responses about age (when did you stop drinking, when did you first have sex? when did you get married, when did you retire, etc). Digging into this, i found a report on how age data is collected in certain circumstances and, it looks pretty scattered, but you can see it here. So, maybe there’s not a one-size-fits all for all age-data collection situations, but I do think it’s worth trying to get to a standard set (possibly a standard per context). I’m not sure i have a great answer for you but: I wonder if we should standardize on age classification (infant, child, young adult, adult, older adult) so that local cultures can map what their interpretation of each of these classifications to what it means in their local healthcare-society. what I mean is: 18 might be treated (medically) as an adult in one nation, but 21 years old is the threshold in another. I’m not sure if the exact biological age is what the important factor when understanding medical care, or how the patient was treated from an age classification (would love to hear from experts about how that’s done).

So, I’m sorry, I don’t have much to add here, but I wanted to thank you for your clear explanation and hard work.

1 Like

Thanks @Alexdavv! Great effort to set up this proposal and outline all the open issues.

I do want to advocate for a pre-coordination approach. Agreed, it is not as nice and well thought through as your proposal, but it does not require any cdm or vocabulary modifications. In the below example, we capture the source values and allow the user to use the source vocabularies to find a particular observation.

Let’s say we want to capture the question ‘Alchohol use’ with answer ‘Frequent drinker’, in the source data this is represented as field 2001 and value 3.

field value
observation_concept_id 4052351 - Alcohol Intake
observation_source_value “2001|3”
observation_source_concept_id 9876543 - “Alcohol Use, Frequent drinker”
value_as_concept_id 4322298 - Frequent

For the numeric, free-text and date values, we would use the respectieve value_as_ fields. The observation_source_value in these cases would just be the field, because we can’t pre-coordinate all the values in this case.

1 Like

Well, you pre-coordinate for the source (despite it’s a question-answer pair), while you post-coordinate in the Standard target (despite, the OMOP wants the facts).

What if you have a fancy set of pre-coordinated alcohol consumption facts/concepts? The same as we proposed for smoking. You don’t want to move all the extra garbage to the target/Standard side, right? So why would you need to keep the survey being Standard?
It’s already a problem when you try to map alcohol consumption to a proper target, look how many Athena has. What if we’ll add a couple more Survey vocabs soon?

The survey is not the only case where these additional fields are useful. Complex oncology, survey, lab tests data, and all the stuff around the MAPPING table requires us to re-evaluate the definition of the source data fact:

  • Is it still a single entity or concatenated combination of 2-3 of them?
  • Or it’s always a set of specific entities so that we can not concatenate, but JOIN using several fields at once? Here is an example enriched with UKB mappings (mappings of questions to themselves are in red).

So I feel we’re about some changes around it anyway.


We should wrap this up and get it to the CDM WG discussion. However, there are a bunch of issues intertwined here that we may wish to separate:

Issue 1: Question-Answer pairs, consolidation of questions and answers, and pre-coordination
Issue 2: Domains of survey data, where do they go, and are they worth standardizing

And there are the issues that we have been discussing, for which I would suggest we continue those Forum debates.

Issue 3: The new wide mapping table for complex mappings.
Issue 3: Common issue standards like smoking and drinking.
Issue 4: History of facts and what’s important to represent and standardize.

So, let’s continue here issues 1 and 2.

As followup to the UKB working group meeting today, a good test category is “mental health” (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=136). The results of these self-assessments are of high interest to researchers.

Surveys (and other EAV-type data, like registries and clinical trials) are defined as question-answer (sometimes variable-value) pairs. We have 30k questions and 40k answers today in the Vocabularies. But that very construct causes trouble for our way of standardized representation of facts and querying them, because:

  • They are spread over Measurement and Observation domains (questions) and Drug, Meas Value, Observation, Procedure domains (answers).
  • They mostly are defined locally, and not independently, but there are some standard ones.
  • The separation into question and answer is arbitrary. For example, among the existing survey concepts there are 91 questions and 102 answers containing the word “Diabetes”, usually in the context of current, historic, family, treatment or complication.
  • Questions, but especially answers are highly repetitive. For example, there are 2941 questions with one possible answer “No”, to be picked from three different “No” concepts. There are 12 different answer concepts for “White” and “Black”. Do these mean the same or are they all of a different meaning?
  • There is a ton of the usual junk like flavors of null, “Other”, “Obsolete”, etc.

It’s a mess. ATLAS really cannot use them. So, the question is now what do we want to do with them:

  1. Keep adding questions and answers as standard concepts (hopefully with some domain cleanup)
  2. Clean them up, consolidate identical answer concepts and pre-coordinate them with the questions
  3. Create non-standard or 2B concepts, so they can be used for local querying
  4. Kick them out of the Vocabularies completely and have local non-CDM tables for them

I don’t have a strong tendency, but if I had to pick I would do the second. That way, at least we don’t have ambiguity in the meaning of things. 3) would require the addition to a source_value_concept_id to the Observation table.

But what I really would recommend is a good mapping to proper domain concepts. For example, resolve the LOINC question-answer pair Other health condition- Med: Diabetes mellitus to the Condition concept Diabetes mellitus. Only then, the information in the surveys becomes queryable in a standardized way, like in ATLAS.

However, there is a but: It’s a lot of work, and who is going to do it? And what if we cannot find an existing standard concept, like for the ones @mcantor2 wants. Are we going to create new ones?


To some extend as every other source vocabulary we’ve tackled.
But the real reason why we can’t…

…is thousands of duplicates - when every distinct piece of valuable information is recorded in many various ways.

It’s even worse than keep general (non-EAV-type) vocabularies being all Standard - there’s no way to build a cohort out of UKB, PPI, NAACCR, and CAP. You simply can’t know all the stuff underlined. While it’s quite possible with ICD, MedDRA, or Read, agree?

Not the actual option since pre-coordination is only a way to represent the combinations. But you still have the same number of combinations to deal with.

Not simple as that. Half of the EAV-type data imply the real meaning in the answer, while another half - in the question.
So we have to consolidate twice. And then dedup on Q-A vs A-Q principle.

And even though we did this huge job, how would these very specific “ever in your life / never ever / past history / at what age / in childhood / within the last x years, etc.” be useful?

Ok, we can dedup them between each other, but they’re still counterparts of the real OMOP “History of… / Condition” instances.

What we really need to do is to map them to the general OMOP-like instances.

Sounds good, but requires us to take the actions mentioned above (new fields, new survey data convention). And not only for the Observation table. I think we might want to follow the link from the source EAV-type to the target condition_occurrence record.

2B might be a case for the vocabularies that are not used among multiple institutions, while others may become a part of OMOP once it would be useful for the community.

Well. What would be a CDM part then for many EAV-type sources? Person table?

How about this:

  1. Introduce the _value_source_concept_id and _value_source_value fields to the required tables. Make these fields visible for ATLAS.
  2. Syncronyze the field list between the wide mapping tables and event tables. Get rid of concatenated codes/values in ETL - make JOINs on multiple fields instead.
  3. Change the convention and start treating EAV data (including Survey data) as non-Standard source data within a new Domain. Consistently deStandardize it in the vocabularies (including LOINC, PPI, UKB, NAACCR, etc.). Ask users to aim to the _source fields in their specific queries, but using the concept_ids, not the source_values.
  4. On the use case basis start mapping of EAV data to OMOP Standart instances. Introduce new concepts like for Smoking or Lab tests (or vocabularies - like Cancer Modifier) if needed.

That is a tricky question. But what if we change the concept of the “source data” a little bit:

  • if the data is dataset-specific, we can make it OMOPed and queriable by using the source_concept_ids (2B or in Athena - see the rule above);
  • once we identify the same data in the multiple sources and a need to query it, it becomes a Standard within a well-curated vocabulary or OMOP-generated stuff.