One of the items that came out of the OHDSI Symposium Themis Working Group meeting was should all persons in raw data be brought over?
Our group has prepared a statement making a recommendation on what type of people to drop from a CDM build. We would like to propose adding this statement on the conventions for PERSON.
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality helps researchers using the CDM. Example scenarios where you might want to exclude a subject from the CDM include: unknown gender, person lacks prescription or health benefits in claims database, person’s year of birth or age are unreasonable (e.g. born in 1800), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient should be made when it makes sense in consideration of the raw data source.
@Rijnbeek has already commented about “what if I’m doing research where gender does not matter”. Very often gender will play a role in our analysis. For example often a Table 1 will often include gender break outs. At Janssen we used to keep them in almost every time when I built my Table 1 and found I had 1 unknown gender person I was asked to remove and perform the analysis again. Since they caused us more harm than analytical good we have chosen to remove them. Additionally often unknown gender occurs very infrequently (<0.00% of the time in Truven CCAE/MDCR, <0.17% of the time in MDCD - which is slightly higher most likely due to the high birth rate and initial misclassification).