The source data that I shall receive will only contain de-identified patient ids, age relative to an admission date and gender. However in the PERSON table, several of the fields are documented as being required. I understand that to mean that these fields are mandatory. Examples are YEAR_OF_BIRTH and RACE_CONCEPT_ID.
Is it acceptable to flag these fields as NULL when mapping the source data to the PERSON table, when source values do not exist ?
year_of_birth: Canât put null in there. Neither is 0 allowed. We donât want all these âJesussesâ in the database. The age of a patient is so important to observational research that we have the convention to kick out patients without known age.
race_concept_id: You can put 0 in there, which means Unknown. But not null.
CDM documentation for the PERSON table specifies that the approximate year of birth should be âderived based on any age group categorization availableâ for data sources where the year of birth is not available. http://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:person
Are there any sources that explain how approximate year of birth can be derived?
Some sources de-anonymize data by removing the birth date, and instead giving you an age bucket. If you have on of those you would create a fictiticious birth year putting the person exactly in the middle of the bucket. If you donât have either, you have a problem. Do you have data without any birthday information?
Yes, about 40% our data sample do not contain any indication of the birth year. Because of such a large volume, I hesitate to simply exclude those records unless we have no other choice. I was hoping that there was some algorithm that could produce meaningful birth year estimate based on other available data.
if you are really stuck with no birth year, and no realistic hypothetical birth year you could set an obviously fake birth year - for example, â1800â to denote patients with no known birth year.
And then make sure it is in the annotation somewhere. Although it would be better to set a birth year that people know definitively must be fake upon first inspection - for example, the birth year â1000â.
That presupposes that whoever is âlookingâ at the data is a human, who knows that nobody alive can be born in 1800. But the most common analyst of data is a method or a tool without any human flesh. Therefore, if we want to have a convention like that we have to all agree and make it part of the CDM proper.
Based on what? Many people have no data, they are healthy. But they are still needed for the denominator. And there is no way you can know the age. The only thing you could do is to guess a mean birthdate for everybody (like 1-Jan-1970 or so). If an analytic stratifies by age there will be only a single age group. Not the end of the world.
@Mary_Regina_Boland, @Christian_Reich :
Inserting an obviously fake year of birth would just defeat making this column a required field, wouldnât it? Is it an option to make this field nullable in CDM? It would seem that there is still some value in clinical data even if patientâs age is unavailable. This is just my impression, but I am not a researcher and may be speaking out of ignorance.
When I mentioned an algorithm for deriving a birth year based on other available data, I was thinking along these lines: finding a neonatal disease diagnosis would meaningfully place a patientâs year of birth within 2 years from the date of the diagnosis. I realize that it would only apply to a very small number of patients, so I am using this just as an example. I was wondering if there was an existing method to interpret various recorded events as some indication of a patientâs age, but it sounds from your response, Christian, that there isnât.
I wonder if this is an application for a predictive model where youâre trying to predict an age for a patient based on a model derived from another source. I think the notion of building a model from a data-set with ages so that you can apply it to a data-set without ages is probably dubious, but Iâm imagining if you did have a model trained, you can look at each patientâs record in the age-less data-set and determine what an age might be at each event, and then make an educated guess off of all the predicted ages youâve made for a person and choose the most likely year of birth.
@Christian_Reich, @ZPGoldman:
inserting a fake birth year doesnât necessary presuppose that the algorithm is human.
My algorithm currently restricts data to only those born after 1900, with the assumption that prior birth years are dubious/fake. And I have encountered obviously fake birth years in the past. So using such a restriction would not require a human algorithm, but rather a human programmer.
On the other hand, I totally agree that most data without birth year is not useful for research. That being said, I think it would be good if there was a distinction in the CDM between imputed birth years (from age range and collection year) and actual birth years reported by patients. This would allow researchers to easily distinguish different âqualitiesâ of birth year data, which could be more or less important depending on the specific analysis. I didnât realize until this discussion that researchers were imputing birth years and reporting them in the CDM as true birth years.
Overall it might be better to allow a null birth year, and then if additional information is later gathered about the patients, it could be added at that point. Not allowing nulls in the birth year field would force users to be unable to transform their data at all, which is probably not ideal for other reasons.
Let me answer in detail, because we keep running into these arguments quite a lot, and I am pretty passionate about not going down a spiral that will turn a pragmatic and useful model into a nightmare.
If you happen to know a âhuman algorithmâ please make an introduction. I would love to talk about his/her/its feelings, maybe over lunch.
Here it is. Right here: âResearchersâ. Look: The CDM has to support algorithms where no Researcher has seen the data. For example, for our distributed studies we will develop the code using one database, and then it will have to run on all the others with no more intervention, because most databases will be off limits to the Researcher. All content has to be organized so it can be blindly relied upon. That is why we cannot invent database-specific conventions as we go. Itâs a standard.
We could do that, but: Each time we introduce such a detail all tools and methods will have to change and incorporate these options making if/then/else statements, when right now they just rely on the birth year. So, there is a tradeoff between capturing every possible piece of information and keeping the CDM from becoming unwieldy. Unless we have a really good use case for this imputed/non-imputed thing I would veto it.
Remember: The CDM is there to project real patients and their healthcare experience. Not the idiosyncrasies of all sorts of databases collected for all sorts of reasons.
If âResearchersâ do that post-CDM, which means, in some piece of code that they write for themselves - fine. But they should not do that when filling out the CDM. Patients with no birth years should be out. If folks impute anyway - they can, however, it will make a lot of the algorithms fail, which are written with the expectation that the birth year is known. C*** in c*** out.
So, unless you have a good reason otherwise Iâd say no Jesusses and no null birth years. If people have data that cannot be used for our type of work - donât use the CDM.
Letâs take me and assume I am in the database. Thank God I have had only two interactions with the healthcare system: one for an external ear infection, and the other one because 10 years ago a chain saw cut into my hand (a little) when the tree came crushing down. Now what? How old am I? The guy holding the saw was 25 years older, btw., but that information, even if it were useful, isnât recorded in the data.
Remember: The distribution of illnesses is very tail-heavy. Most people are healthy, and some are sick, and they are more likely very sick. The latter ones you might be able to guess better, particulary if they have diseases that are age-dependent.
If the source data only contains age of a patient and we use that to derive the year of birth, would that comply with the CDM requirement of having a value for that column? Or does the year of birth have to exist in the source data as an original value?
Sorry to bring back such an old post, but was discussing during the ETL course today. While I understand that much observational research wants/needs an age, I would argue that an age is not necessary for all analyses. For a few of our use cases (post-market surveillance, blood transfusion safety) we may not have ages for all patients (eg, unidentified patients, deidentified donor data). By forcing a birth year, it prohibits the use of the CDM for several potential projects, while allowing it to be nullable would only require that a filter/where be added to tools/queries.
Here is the thing: The birth year is usually not missing very frequently. So, keeping these records will require extra effort for the analyses that need age to work well. And for the ones that donât need birth year a few missing records will not change the result much. So, lots of extra work when keeping the âJesussesâ and no harm when dropping them.
@Christian_Reich - Sorry for reopening this post after a long time. I have a question regarding the gender_concept_id field which is a mandatory one. For instance, if we donât have gender (Male/Female) data present in the source data, based on CDM convention, does that mean we should drop those records? While the number of source records with no gender information is only around 20 patients, they do have other information on lab, drugs etc. So, am I right to understand that itâs an OHDSI recommended practice to drop such records (no year of birth, no gender, etc) because they are very essential in observational research
In the OMOP CDM, if you donât have the content of a record you write concept_id=0. Thatâs true for gender and all other similar fields as well. For the other tables, you can also drop the record, for PERSON you cannot do that. Apart of year_of_birth (without which a Person should not be in the CDM instance) it is ok to have concept 0 races, genders, providers and locations.