OHDSI Home | Forums | Wiki | Github

Mandatory fields in the PERSON table


The source data that I shall receive will only contain de-identified patient ids, age relative to an admission date and gender. However in the PERSON table, several of the fields are documented as being required. I understand that to mean that these fields are mandatory. Examples are YEAR_OF_BIRTH and RACE_CONCEPT_ID.

Is it acceptable to flag these fields as NULL when mapping the source data to the PERSON table, when source values do not exist ?



year_of_birth: Can’t put null in there. Neither is 0 allowed. We don’t want all these “Jesusses” in the database. The age of a patient is so important to observational research that we have the convention to kick out patients without known age.

race_concept_id: You can put 0 in there, which means Unknown. But not null.

CDM documentation for the PERSON table specifies that the approximate year of birth should be “derived based on any age group categorization available” for data sources where the year of birth is not available. http://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:person

Are there any sources that explain how approximate year of birth can be derived?


Some sources de-anonymize data by removing the birth date, and instead giving you an age bucket. If you have on of those you would create a fictiticious birth year putting the person exactly in the middle of the bucket. If you don’t have either, you have a problem. Do you have data without any birthday information?

Yes, about 40% our data sample do not contain any indication of the birth year. Because of such a large volume, I hesitate to simply exclude those records unless we have no other choice. I was hoping that there was some algorithm that could produce meaningful birth year estimate based on other available data.

if you are really stuck with no birth year, and no realistic hypothetical birth year you could set an obviously fake birth year - for example, ‘1800’ to denote patients with no known birth year.
And then make sure it is in the annotation somewhere. Although it would be better to set a birth year that people know definitively must be fake upon first inspection - for example, the birth year ‘1000’.


That presupposes that whoever is “looking” at the data is a human, who knows that nobody alive can be born in 1800. But the most common analyst of data is a method or a tool without any human flesh. Therefore, if we want to have a convention like that we have to all agree and make it part of the CDM proper.

Based on what? Many people have no data, they are healthy. But they are still needed for the denominator. And there is no way you can know the age. The only thing you could do is to guess a mean birthdate for everybody (like 1-Jan-1970 or so). If an analytic stratifies by age there will be only a single age group. Not the end of the world.

1 Like

@Mary_Regina_Boland, @Christian_Reich :
Inserting an obviously fake year of birth would just defeat making this column a required field, wouldn’t it? Is it an option to make this field nullable in CDM? It would seem that there is still some value in clinical data even if patient’s age is unavailable. This is just my impression, but I am not a researcher and may be speaking out of ignorance.

When I mentioned an algorithm for deriving a birth year based on other available data, I was thinking along these lines: finding a neonatal disease diagnosis would meaningfully place a patient’s year of birth within 2 years from the date of the diagnosis. I realize that it would only apply to a very small number of patients, so I am using this just as an example. I was wondering if there was an existing method to interpret various recorded events as some indication of a patient’s age, but it sounds from your response, Christian, that there isn’t.

I wonder if this is an application for a predictive model where you’re trying to predict an age for a patient based on a model derived from another source. I think the notion of building a model from a data-set with ages so that you can apply it to a data-set without ages is probably dubious, but I’m imagining if you did have a model trained, you can look at each patient’s record in the age-less data-set and determine what an age might be at each event, and then make an educated guess off of all the predicted ages you’ve made for a person and choose the most likely year of birth.


@Christian_Reich, @ZPGoldman:
inserting a fake birth year doesn’t necessary presuppose that the algorithm is human.
My algorithm currently restricts data to only those born after 1900, with the assumption that prior birth years are dubious/fake. And I have encountered obviously fake birth years in the past. So using such a restriction would not require a human algorithm, but rather a human programmer.
On the other hand, I totally agree that most data without birth year is not useful for research. That being said, I think it would be good if there was a distinction in the CDM between imputed birth years (from age range and collection year) and actual birth years reported by patients. This would allow researchers to easily distinguish different ‘qualities’ of birth year data, which could be more or less important depending on the specific analysis. I didn’t realize until this discussion that researchers were imputing birth years and reporting them in the CDM as true birth years.
Overall it might be better to allow a null birth year, and then if additional information is later gathered about the patients, it could be added at that point. Not allowing nulls in the birth year field would force users to be unable to transform their data at all, which is probably not ideal for other reasons.


Let me answer in detail, because we keep running into these arguments quite a lot, and I am pretty passionate about not going down a spiral that will turn a pragmatic and useful model into a nightmare.

:slight_smile: If you happen to know a “human algorithm” please make an introduction. I would love to talk about his/her/its feelings, maybe over lunch.

Here it is. Right here: “Researchers”. Look: The CDM has to support algorithms where no Researcher has seen the data. For example, for our distributed studies we will develop the code using one database, and then it will have to run on all the others with no more intervention, because most databases will be off limits to the Researcher. All content has to be organized so it can be blindly relied upon. That is why we cannot invent database-specific conventions as we go. It’s a standard.

We could do that, but: Each time we introduce such a detail all tools and methods will have to change and incorporate these options making if/then/else statements, when right now they just rely on the birth year. So, there is a tradeoff between capturing every possible piece of information and keeping the CDM from becoming unwieldy. Unless we have a really good use case for this imputed/non-imputed thing I would veto it.

Remember: The CDM is there to project real patients and their healthcare experience. Not the idiosyncrasies of all sorts of databases collected for all sorts of reasons.

If “Researchers” do that post-CDM, which means, in some piece of code that they write for themselves - fine. But they should not do that when filling out the CDM. Patients with no birth years should be out. If folks impute anyway - they can, however, it will make a lot of the algorithms fail, which are written with the expectation that the birth year is known. C*** in c*** out.

So, unless you have a good reason otherwise I’d say no Jesusses and no null birth years. If people have data that cannot be used for our type of work - don’t use the CDM.


Let’s take me and assume I am in the database. Thank God I have had only two interactions with the healthcare system: one for an external ear infection, and the other one because 10 years ago a chain saw cut into my hand (a little) when the tree came crushing down. Now what? How old am I? The guy holding the saw was 25 years older, btw., but that information, even if it were useful, isn’t recorded in the data.

Remember: The distribution of illnesses is very tail-heavy. Most people are healthy, and some are sick, and they are more likely very sick. The latter ones you might be able to guess better, particulary if they have diseases that are age-dependent.

We need the birth year.

Hello Christian,

If the source data only contains age of a patient and we use that to derive the year of birth, would that comply with the CDM requirement of having a value for that column? Or does the year of birth have to exist in the source data as an original value?



That’s totally fine. Even a rough age range of plus minus 5 years is fine. But Jesusses or everybody with the same age is problematic.

By Christian tradition, my understanding is that Jesus was born in AD 1, which was preceded by 1 BC. So, by definition, 0 is an invalid value. :relaxed:

Sorry to bring back such an old post, but was discussing during the ETL course today. While I understand that much observational research wants/needs an age, I would argue that an age is not necessary for all analyses. For a few of our use cases (post-market surveillance, blood transfusion safety) we may not have ages for all patients (eg, unidentified patients, deidentified donor data). By forcing a birth year, it prohibits the use of the CDM for several potential projects, while allowing it to be nullable would only require that a filter/where be added to tools/queries.


Here is the thing: The birth year is usually not missing very frequently. So, keeping these records will require extra effort for the analyses that need age to work well. And for the ones that don’t need birth year a few missing records will not change the result much. So, lots of extra work when keeping the “Jesusses” and no harm when dropping them.

:slight_smile: Hilarious. Good thing “we” didn’t have EHR back in the days, and we don’t need to submit compliance reports to Pontius Pilatus.

@Christian_Reich - Sorry for reopening this post after a long time. I have a question regarding the gender_concept_id field which is a mandatory one. For instance, if we don’t have gender (Male/Female) data present in the source data, based on CDM convention, does that mean we should drop those records? While the number of source records with no gender information is only around 20 patients, they do have other information on lab, drugs etc. So, am I right to understand that it’s an OHDSI recommended practice to drop such records (no year of birth, no gender, etc) because they are very essential in observational research

In the OMOP CDM, if you don’t have the content of a record you write concept_id=0. That’s true for gender and all other similar fields as well. For the other tables, you can also drop the record, for PERSON you cannot do that. Apart of year_of_birth (without which a Person should not be in the CDM instance) it is ok to have concept 0 races, genders, providers and locations.