OHDSI Home | Forums | Wiki | Github

Survey vocabularies in OMOP


(Alexander Davydov) #1

In the UK Biobank WG we discussed the problem of how to deal with Survey vocabularies. Here, we want to summarize the issues and discuss the possible solutions.

Looking at the initial proposal - while we definitely moving in the right direction, we recognize that there are still more questions that need to be addressed in the long term.

In short, the plan was:

  • Among the different vocabularies de-standardize and remap the question to identical questions (to avoid duplication).

  • Among the different vocabularies de-standardize and remap the answers to identical answers (to avoid duplication).

  • Among a variety of others, leave one distinct question as a Standard concept to be used for constructing survey-like detailed cohorts for users that need to research with all the granularity of survey source data.

  • Using the synthetic concatenated concepts or new MAPPING table to deliver the mappings to the Standard entities, recognized as useful in OMOP. So that, classic OMOP cohorts will provide the value they’re designed for.

  • Introduce new relationship_id ‘Has Standard answer’ to represent the list of the Standard answers for the entire Question.

The issues we found:

  • The questions from the various vocabularies provide very different context/peculiarities, so the mapping rate (questions to questions) is really low. Most of them are gonna be unique Standard concepts that will eventually overload the Standard survey vocabulary.

  • Every specific answer list shades the meaning of the question concept, so the borderlines to be defined in every single concept.

  • Proposed modeling doesn’t address the problem that survey-like cohorts still requires referring to the source survey concepts.

Here is an example (all the concepts below never existed in the mentioned vocabularies, they are synthetic and only created to reproduce the problem):

The concerns on the slide are:

  1. ‘Has standard answer’ link is not really useful for non-Standard Question concepts since they’re not supposed to used in cohorts. For Standard ones, it may be found useful, but only in the Measurement Domain.
  2. If you don’t know the whole list of the possible answers used with a particular question concept, you can’t build universal survey-like cohorts anyway. Every time you still need to jump back to the source vocabulary to check the answer list used (and the mappings of the non-standard answers).
  3. If we put the whole list of the answers derived from many source vocabularies mapped to this particular Question concepts, we’re gonna deal with the ugly mixtures: year ranges (including the conflicting ones) together with categories of different sorts/axes. Technically it may work on the vocabulary side (even though requires a substantial effort to maintain), but not supposed to be user-friendly. Moreover, for many of the Questions you can get the full semantics of the concept only once looked into the list of its answers. Do we want to mix such sort of things?

From here the vocabulary team has started thinking about alternative solution:

The proposal we come up with is turning all the Survey Question/Answer concepts into non-standard. Meaning that we’re still able to build the survey-like detailed cohorts, but using the source_concept_id only. Mapping to the Standard valuable facts will be provided as described above.

And here we come to the CDM limitations:

  1. To build the question/answer pair-like cohorts in the _source_concept_id area, we would need to have the _value_source_concept_id field as well as value_source_value field. The former one doesn’t exist in OMOP, the latter one introduced to the Measurement table only.
    The addition of these fields will resolve the problem. Otherwise, we would need to create these ugly pre-coordinated question-answer pairs forever. From my point of view, the pre-coordination in the survey data is even worse (doesn’t represent the survey source structure, difficult to understand and maintain).

  2. When v6.0 was released (I checked this pdf), the survey_conduct convention described how the survey data should be stored. Now I can’t find anything out of it on https://ohdsi.github.io/CommonDataModel/cdm60.html
    Is it outdated, placed somewhere else or just lost?

It said the following:

  1. Patient responses to survey questions are stored in the OBSERVATION table. Each record in the OBSERVATION table represents a single question/response pair and is linked to a specific SURVEY/questionnaire using OBSERVATION.DOMAIN_OCCURRENCE_ID and SURVEY.SURVEY_OCCURRENCE_ID.

  2. Each response record is the response to a specific question identified by the OBSERVATION_CONCEPT_ID. This concept ID is a unique question contained in the CONCEPT table.

  3. An individual survey question can have multiple responses to a question (e.g. which of these items relate to you, a, b, c). Each response is stored as a separate record in the OBSERVATION table. The name (question) is stored as OBSERVATION_CONCEPT_ID and the value (answer) is stored as OBSERVATION_AS_CONCEPT_ID where the answer is categorical and is defined as a concept in the concept table, OBSERVATION_AS_NUMBER where the answer is numeric, OBSERVATION_AS_STRING where the answer is a free text string or OBSERVATION_AS_DATETIME.

  4. The question / answer observation record is linked to the patient questionnaire used for collecting the data using two new fields in the OBSERVATION table; DOMAIN_ID and DOMAIN_OCCURRENCE_ID.
    DOMAIN_ID for any survey related observations contains the text Survey and DOMAIN_OCCURRENCE_ID contains the SURVEY_OCCURRENCE_ID of the specific survey. This domain construct can be used for other observation groupings.

  5. The OBSERVATION table can also store survey scoring results. Many validated PRO questionnaires have scoring algorithms (many of which proprietary) that return an overall patient score based on the answers provided.
    Survey scores are identified by their OBSERVATION_CONCEPT_ID and are linked back to the scored survey using the same DOMAIN construct described.

If we move the survey vocabularies into the non-Standard area, it requires an updated convention:

  • Question-answer pairs (being non_Standard concepts) will be presented in the source_concept_id and source_value_concept_id fields only. Observation_concept_id and value_as_concept_id fields will be used for Standard target concepts only.

  • The specs says that besides others the possible answers in Survey data are numeric, free text or dates. So the observation_as_number, observation_as_string and observation_as_datetime fields are still “occupied” by the “source” Survey data and we can’t use them in the context of the entire observation_concept_id.
    We drafted a couple of examples to better understand two options:
    (a): store all the types of the survey answers in the value_source_value field only (concept_code, numeric value, datetime or string). Populate the source_value_concept_id field when the answer is a concept. The survey ansers will not be sorted out into the observation_as_number, observation_as_string and observation_as_datetime fields. These field will be used in context of the onservation_concept_id only.
    (b): always intentionally create a separate CDM record for the target standard Observation concept. So that the Survey “source” data is always captured as a separate CDM Observation record, and answers can be sorted out into the observation_as_number, observation_as_string and observation_as_datetime fields. It’s a little bit against the basic principle of data transformation in OMOP. Would it be easy to implement on the ETL side? What would be an indicator of applying such transformation or sorting between Survey source and Standard records? New ‘Survey’ Domain, being not a new table, but a characterictic of the concept, may be a choice in the long-term run.

We would like to hear other ideas and get feedback from the community.
Tagging the working group and people involved in the survey data discussions before:
@Alexandra_Orlova @Andrew @anna_corning @aostropolets @Chris_Knoll @Christian_Reich @clairblacketer @cmkerr @ColinOrr @Daniel_Prieto @Dave.Barman @Dymshyts @ellayoung @ericaVoss @gregk @Josh_R @kyriakosschwarz @lee_evans @linikujp @MaximMoinat @mcantor2 @mik @mmandal @MPhilofsky @mvanzandt


(Alexander Davydov) #2

Well, since there are almost 4000 of us, let me tag some more than 25
@nick @nlw @parisni @Patrick_Ryan @QI_omop @rookie_crewkie @schillil @SCYou @SELVA_MUTHU_KUMARAN @spiros @Vimala_Jacob @Vojtech_Huser @zhuk


(Chris Knoll) #3

That was a great read, @Alexdavv. I appreciate your attention to standardization by finding something that will work across networks. I realize that the example in question is about smoking, but I wonder if the OMOP cdm should have a standard set of age categories for which all those ‘source’ responses about age (when did you stop drinking, when did you first have sex? when did you get married, when did you retire, etc). Digging into this, i found a report on how age data is collected in certain circumstances and, it looks pretty scattered, but you can see it here. So, maybe there’s not a one-size-fits all for all age-data collection situations, but I do think it’s worth trying to get to a standard set (possibly a standard per context). I’m not sure i have a great answer for you but: I wonder if we should standardize on age classification (infant, child, young adult, adult, older adult) so that local cultures can map what their interpretation of each of these classifications to what it means in their local healthcare-society. what I mean is: 18 might be treated (medically) as an adult in one nation, but 21 years old is the threshold in another. I’m not sure if the exact biological age is what the important factor when understanding medical care, or how the patient was treated from an age classification (would love to hear from experts about how that’s done).

So, I’m sorry, I don’t have much to add here, but I wanted to thank you for your clear explanation and hard work.


(Maxim Moinat) #4

Thanks @Alexdavv! Great effort to set up this proposal and outline all the open issues.

I do want to advocate for a pre-coordination approach. Agreed, it is not as nice and well thought through as your proposal, but it does not require any cdm or vocabulary modifications. In the below example, we capture the source values and allow the user to use the source vocabularies to find a particular observation.

Let’s say we want to capture the question ‘Alchohol use’ with answer ‘Frequent drinker’, in the source data this is represented as field 2001 and value 3.

field value
observation_concept_id 4052351 - Alcohol Intake
observation_source_value “2001|3”
observation_source_concept_id 9876543 - “Alcohol Use, Frequent drinker”
value_as_concept_id 4322298 - Frequent

For the numeric, free-text and date values, we would use the respectieve value_as_ fields. The observation_source_value in these cases would just be the field, because we can’t pre-coordinate all the values in this case.


(Alexander Davydov) #5

Well, you pre-coordinate for the source (despite it’s a question-answer pair), while you post-coordinate in the Standard target (despite, the OMOP wants the facts).

What if you have a fancy set of pre-coordinated alcohol consumption facts/concepts? The same as we proposed for smoking. You don’t want to move all the extra garbage to the target/Standard side, right? So why would you need to keep the survey being Standard?
It’s already a problem when you try to map alcohol consumption to a proper target, look how many Athena has. What if we’ll add a couple more Survey vocabs soon?

The survey is not the only case where these additional fields are useful. Complex oncology, survey, lab tests data, and all the stuff around the MAPPING table requires us to re-evaluate the definition of the source data fact:

  • Is it still a single entity or concatenated combination of 2-3 of them?
  • Or it’s always a set of specific entities so that we can not concatenate, but JOIN using several fields at once? Here is an example enriched with UKB mappings (mappings of questions to themselves are in red).

So I feel we’re about some changes around it anyway.


t