I disagree with this. I’m not advocating for things to be arbitrary removed however as someone with years of experience with the raw data and the manager of the ETL process, if I learn something from the data vendors that they recommend to exclude it is better if that is implemented in a repeatable, transparent, and standardize way with the CDM build than allowing each statistician to interpret how to implement a rule or even be knowledgeable if they should be doing something in certain situations.
I do agree with you that ETLers should strive to bring as much over as possible but if something is known to be suspect than you do your statistician a service by eliminating it in a standardized way. Additionally the ETL document should discuss all of these coding decisions so it can be clear to the user.
I want to be clear because I know this is a sensitive issue, when I’m discussing “unknown gender” here I do not mean to be discussing a switch in gender identification (there is a whole thread that digs more into that). If your data truthfully is able to capture people who experience/identify with a gender change than you should represent that and the Vocabulary should help us represent that. However, at least in claims data, it is more likely administrative error and not because someone has switched their gender. For example, we’ve found a handful of people where it looks like their IDs got reused because they were female for a few years, disappeared and suddenly male and a different age.
All these point make me think the language needs to be softened/made clearer. This is a meant to be a recommendation and not a requirement. I can take another crack at it but also open to input! Thank you for the discussion.