OHDSI Home | Forums | Wiki | Github

Survey data mapping - Question and Response pairs

Hello Everyone,

We have a survey data which has to be transformed into OMOP CDM format. I have given an example below, so that it would be easy for you to understand.

Example 1

Survey Question - How often do you drink alcohol? (**Concept_Id = “40771103” with class = “Survey” and domain = “observation” **)

User Response - 3 days a week (not found in Athena)

Example 2

Survey Question - What is your annual income? (not found in Athena. PS - I am not looking for family income but indvidual’s annual income)

User Response - Annual Income: less 10k (Concept_Id = 1585376 with class = “Answer” and Domain = “Observation”)


  1. As you can see from above examples, what should I do when I don’t find the right concepts for Question/Responses in Athena? Can we add the missing questions and responses as concepts in our Concept table with concept_id > 2B? I should just make an entry (using Insert Into query) in my Concept table. Is my understanding right? Is there anything that I should do with Source_to_Concept map table? Can you please guide us on mapping these survey items to CDM format?

  2. In @ericaVoss @Patrick_Ryan @margaret 's work on survey data based on NHANES dataset (http://www.ohdsi.org/web/wiki/lib/exe/fetch.php?media=resources:using_the_omop_cdm_with_survey_and_registry_data_v6.0.pdf), I see that the response “Nearly every day” to Concept_id = 45882010 under “Answer” class and “Meas_value” domain. Can you please help me understand why not Concept_id = 763699 under “Qualifier Value” class and “observation” domain? Because the latter seems to indicate the frequency. I might totally be wrong here as well. I have started to learn OMOP and all medical terms only recently. Kindly request you to correct my understanding.

  3. a) Is there any specific class and domain should I be looking into for survey questions?
    b) Is there any specific class and domain should I be looking into for survey responses?

  4. Am I right to think that “observation” table will be the large(in terms of size) table when compared to other cdm tables as it has 40 records per person, considering each patient has to answer 20 survey questions (20 records with different columns for questions (observation_concept_id) and responses (value_as_concept_id).

  5. Is there anything else that I should be aware of when converting survey data to CDM data?

  6. Adding missing concepts isn’t time consuming? I mean surveys can vary a lot between regions/groups. The only way is to add the concepts?

Once we learn how to do for one table, I believe it will be easy for us to do for the rest without any issues. Require your support


Hello Selva,

Yes, exactly. This is actually quite common as OMOP vocabularies cannot contain every single entity in the world, so sometimes you need to add your own, as long as you don’t find anything close in Athena.

I’d propose the following way:

  1. Add your >2B concept to the Concept table. You can come up with any vocabulary_id as long as it doesn’t clash with any of OMOP standard vocabularies. The new concept is considered standard, valid from 1970-01-01 00:00:00 to 2099-12-31 00:00:00, invalid_reason is NULL.
  2. Add “Maps to” and “Mapped from” relationships in Concept Relationship table (that is, two records for each new concept), where both concept_id_1 and concept_id_2 are equal to >2B concept_id.
  3. Add a record to Concept Ancestor table, with ancestor_concept_id and descendant_concept_id both equal to >2B concept_id.
  4. Now you can map the source codes to your new concept in Source to Concept Map table in the same way you’d map them to any other standard concept.

As it seems complicated, here’s an example from survey data:

  • “Number of adults in household” survey question cannot be found in Athena. Creating a new concept:
    concept_id = 2000000001
    concept_name = “Number of Adults in household”
    domain_id = “Observation”
    vocabulary_id = “Survey_Example”
    concept_class_id = “Survey”
    standard_concept = “S”
    concept_code = “num_adults” (this is how it’s in source data)
    valid_start_date = “1970-01-01 00:00:00”
    valid_end_date = “2099-12-31 00:00:00”
    invalid_reason = NULL

  • Add two records to Concept Relationship:
    2000000001,2000000001,“Maps to”,“1970-01-01 00:00:00”,“2099-12-31 00:00:00”,NULL
    2000000001,2000000001,“Mapped from”,“1970-01-01 00:00:00”,“2099-12-31 00:00:00”,NULL

  • Add a record to Concept Ancestor:

  • Map the source value to the new concept in STCM:
    source_code = “num_adults”
    source_concept_id = 0
    target_concept_id = 2000000001
    target_vocabulary_id = “Survey_Example”

It kinda is, but it’s quite easy to automate the process. Also, make sure you’ve searched good – sometimes it helps to filter the concepts in Athena by class/domain/vocabulary and just look through the whole list because the same (by meaning) survey question can be expressed in totally different words.

1 Like

How about this answer?

Thanks @rookie_crewkie for explaining the procedure in detail with example. This certainly helps. Will try doing this and reach out to you if I have any issues. Thanks a ton

Removed the post

Hello @rookie_crewkie - I have a quick question. Do you foresee any issues if I create concepts for all my survey questions? I mean with whatever search I did in Athena for survey question and response, I could only find 5-10 pc of records matching my source data. As I am automating it, I thought I will just do it for the complete source data instead of relying on Athena for that 5-10 pc. Will this create any issues? Or what is the drawback of this?

We are interested in getting accurate questions and answers. Say If I am using Athena, I have to settle for 60-70/80 pc match.

Looking forward to hear from you


Hello Selva,

Maybe it was my fault to incautiously use the “quite often” term above :slight_smile:

The main purpose of standardized OMOP vocabularies is to bring myriads of entities from various medical coding systems together and establish semantic connections between them. This is the hard job done by OHDSI Vocabulary team, and other community members benefit from it – having a single, well-maintained catalogue of medical terms means that a concept_id from your CDM instance will have the same meaning for every other researcher. This also means that community-created tools (e.g. ATLAS) rely on standardized vocabularies.

But in real world, there are things in source data which do not fall nicely into the standardized vocabulary model, so it’s acceptable to create your own “additions” to the list of OHDSI-provided concepts. At the same time, this imposes a requirement to ship such additional concepts along with the dataset, otherwise the concepts you’ve created won’t have any meaning.

So for the reasons described above, I’d say the best practice is to try and map the most you can to the standard concepts. They can have different wording, be more general or more specific, have different units/quantities (this applies mostly to Drugs/Devices), but in the majority of the cases it’s feasible. If all else fails – okay, map the source code to your own concept_id and don’t forget to keep your vocabulary additions close to the dataset to leave an opportunity to decipher it.
It is definitely tempting to just make everything “custom mapped” and finish the ETL in 5 minutes, but why use CDM at all then?

1 Like

Thanks for the response @rookie_crewkie. Appreciate it. Understand the use of CDM but do you think mapping raw source terms to Athena terms with 60/70 pc of accuracy is fine? I mean I am thinking about the analysis Phase. let’s say we have a question

How many hours do you run weekly? (source term)

How many hours do you run (daily) ? (Athena term)

Please note that this is just an example. The valid concept for source might exist in Athena but just sharing an example that I could think of immediately

In the above case, if I match the above two terms, it might be 80-90 % accurate but the meaning is totally different. This was the case with my actual data. We saw <=50 % accuracy and also felt the meaning was different while reviewing. Hence reached out to you.

So, will this sort of information loss impact our analysis to answer questions about patients? I understand the using the concept_id from Athena enables reproduciblilty of our study in another site as well. But will this not impact us while deriving insights.

As you might be aware, we have something called features in Machine Learning. Let’s say our Athena term “How many hours do you run (daily)” might be a significant variable in determining our outcome. But its actually “weekly running hours” that’s influencing the outcome (As we have compromised on accuracy and mapped it anyway)

I am just thinking out loud. Might be wrong as well. quite new to OMOP. I haven’t explored the analysis part yet.

  1. In addition, we key in our custom concepts into the tables like Source_to_concept_map, concept_ancestor, concept_relationship tables only to kind of maintain the usual procedure for vocabulary terms. Irrespective of whether I enter the custom concepts in these tables or not, I will still be able to make use of the concepts as long as it’s in concept table. Am I right? We do this insertion in multiple tables only to help us kind of search terms with parent_id or child_id etc. This is not gonna impact our analysis anyway. Can you correct my understanding?

  2. Usagi mapping returns matching concepts with a score of <=50 pc for all our raw data. Hence I was thinking to add everything. Your inputs would be helpful for us to proceed further

  3. I also tried doing manually. For example, one question is “When was the last documented fasting plasma insulin done?” but what some closely matching terms in Athena is “When was the last occasion? /When were you last able to do this?” search results with “when” and “doc” together doesn’t help either.

Kindly request you to share your insights and experience. Thanks a ton