OHDSI Home | Forums | Wiki | Github

Deriving vocabulary for person table

Hi All

Just started learning the OMOP model… I am trying to understand the vocabulary for the Person table.
Based on the synpuf files, in the OHDSI virtual box, B I am trying to explore and understand the person table.

I am unable to figure out how to derive the data for the columns ethnicity_source_concept_id,race_source_concept_id,gender_source_concept_id.

Thanks for your help in advance


@dirkgibs *source_concept_id stores the source data as it appears in your vocabulary. It is not mandatory, so you can use standard concepts for race, ethnicity, and gender and populate *source_value with the entities you have in your data (e.g. gender_source_value = ‘F’ and gender_concept_id = 8532 for a female patient).


Hate to say that, but have you looked at the documentation? It’s all in there. No need to reverse engineer from the VM box.


Unfortunately, I experienced the same confusion. Even after reading the documentation, I still didn’t understand exactly what it meant. For instance for gender_source_concept_id: “A foreign key to the gender concept that refers to the code used in the source.” This may seem clear to you, but not to me.

Eventually, I decided it means the following:

  1. IF your source system stores gender data according to some OMOP
    recognized vocabulary, gender_source_concept_id should hold the
    concept_id for that value.
    1a. IF vocabulary is a standard concept, the gender_source_concept_id and the gender_concept_id will be the same.
    1b. IF vocabulary is NOT a standard concept, but one which maps to a standard concept, then the gender_source_concept_id and gender_concept_id will be different. The standard concept_id will go into gender_concept_id, and the non-standard concept_id go into the gender_source_concept_id

  2. BUT IF #1 is false, then there is no gender_source_concept_id. Therefore, you should put 0 (zero) in it.

For example, the Standard OMOP concepts for gender are:

  male = code:M concept_id:8507 
  female = code: F concept_id: 8532.  

1a. If my source system actually stores these code values (M and F) to represent male and female, then 8507 or 8532 will go in both the gender_source_concept_id and gender_concept_id.

2.If my source system stores some other codes (say, 1 and 2) to represent male and female, then gender_concept_id would have either 8507 or 8532, but gender_source_concept_id would have a value of 0 (zero).

The case of 1b, is hypothetical here, because there are no non-standard gender vocabularies in OMOP.

Christian, is this essentially correct? Or am I misunderstanding?

Do you have a good way to explain it succinctly?

Gender is a bad example, @roger.carlson. Reason is the two genders nominally constitute a vocabulary, but in practice people hardwire in their ETL whichever way they are represented in the source.

But overall you are correct. In theory there could be a source vocabulary “Roger’s Genders”, with the entries “F” and “M” (your 1b). These would be represented in the gender_source_concept_id. They’d map to the standard OMOP concepts 8532 (female) and 8507 (male), which would go into the gender_concept_id. Because they are the standard concepts in the Gender domain. If your data use the same vocabulary as OMOP and use 8532 and 8507 (1a) both gender_concept_id and gender_source_concept_id would contain the same values.

For 2): Correct. And you would come to the vocabulary team and request the addition of this new gender vocabulary. Or you would built it yourself using concept_ids > 2 Billion.

Do you have a good way to explain it succinctly?

Ha! No I don’t. What I wrote above is about as succinct as I can get, and I’m not sure it will be any more understandable to someone who doesn’t already understand it. Perhaps it would be useful to have a link to a more verbose explanation with a few examples.

Gender Data:

Well, I went ahead and mapped gender, ethnicity, and race for the person table. It gave me experience using USAGI and the source_to_concept_map table. I did not, however, create my own concept_ids.

The source data I’m working with contains “M” and “F” to represent male and female.

My initial thought is the gender_source_value would contain the “M” or “F” values from the source data, gender_concept_id would contain 8507 for “M” and 8532 for “F”, and gender_source_concept_id would be populated with 0 since there is no source code for the values.

However, based on the previous comment above, it seems gender_source_concept_id AND gender_concept_id would be both populated with 8507 or 8532.

Which one should I follow?

gender source should contain M” or “F” values from the source data
gender_concept_id should contain 8507 for “M” and 8532 for “F”
gender_source_concept_id should be populated with 0 since there is no source code for the values
You had it correct the first time

1 Like

For future reference: this has now been more clearly documented on the OMOP CDM conventions page dataModelConventions.knit

1 Like