OK, I’m asking for trouble @Christian_Reich.
I think there is quite a bit of value in keeping the data needed to construct the following somewhere:
(A) The frequency of the occurrence of clearly incorrect data in source X for values of Y and, if you make any substitutions or deletions of Y in the ETL,
(B) the fact that some other value has been substituted for it and
(C) what value was substituted for the original value
A is essential to inform the level of trust we should have in source X for values of Y. Everyone should feel less confidence in using X as a source for Y if a third of the values are bad than if .00001% are bad. We need a way to capture and use that information. B and C are needed for sensitivity analyses to assess the impact of alternative choices of substituted values in analyses that include Y from X.
I agree that we can’t simply keep values that will introduce error into studies (implausible values or "9999"s) without some reliable method for removing them from analyses. But I strongly disagree that we should simply delete implausible values, if that’s what you are suggesting. We need to keep a standard machine readable representation of that fact in some standard location. Otherwise, we are ignoring important information that can and should be used to assess data quality and its contribution to measurement error. Or, if it’s only noted in a document somewhere, it’s not in a form that can be used efficiently in analyses.
A convention like “9999” if used consistently - and that consistent use could be a Themis check - can identify these and leave them out. I don’t have a strong opinion on what strategy is used to keep known bad values out of analyses. It only matters in the end that:
-
Some standard is followed that prevents incorrect data from being used in analyses
-
Important data quality information is not buried in the ETL process, and
-
Value transformations from the source are captured and represented in a way that allows standardized sensitivity analyses