DQD - FAQ's

Akshay · May 17, 2021, 9:52am

Hello,

We are trying to customize DQD rules to suit our site. I have few questions listed below. Can you help us with this?

a) I see DQD has two contexts verification and validation. May I know how can I find the rules under validation category? In the github csv files, I don’t see any specific column which indicates the context (except for 3 rows under check_descriptions.csv). But in DQD dashboard, am able to see that there are around 402 validation context rules. Am trying to locate these 402 dq checks which comes under validation category. or the validation is only for 3 scenarios such as implausible gender, person completeness and null in non-nullable field across different tables? Is there any other validation based DQ checks?

b) where can I find info on the external benchmarks/ values used for our validation check? I see that for validation checks, the data is compared with external source. can we know what is the comparator here?

For example, our dataset had 402 validation checks, out of which 1 failed. I would like to find out from where does it pick the info on the external benchmark? Against which value it is comparing our raw data? I know for verification, we can find the threshold limit for columns in Excel sheet. But for validation, where can we find this?

c) In the concept_level.csv and field_level.csv, I see there are columns like PlausibleGenderNotes``plausibleValueHighNotes, plausibleValueLowNotes, validPrevalenceLow, validPrevalenceLowThreshold etc. Am unable to understand how these fields are used. May I know what’s the use of these fields and are they even used for any DQ checks?

DTorok · May 17, 2021, 1:14pm

There are 4 CSV files that drive the DQD tests.

_Check_Descriptions.csv
_Table_Level.csv
_Field_Level.csv
_Concept_Level.csv
In _Check_Descriptions.csv the column checkLevel defines which of the 3 other CSV files holds the test. Column kahnContext defines if the test is Validation or Verification. The column evaluationFilter defines the column header and value to look for in one of the other files to determine if the test should be run.

For Example in _Check_Descriptions.csv
CheckLevel:=Field
CheckLevel:= plausibleDuringLife
kahnContext:= Verification
evaluationFilter := plausibleDuringLife==‘Yes’

Now look in the _Field_Level.csv file. Find the column labelled plausibleDuringLife. For each row in that column where the value is ‘Yes’ the test will be run on that Table/Field. To see the code that will be run to implement the test go back to _Check_Descriptions.csv and the column sqlFile names the SQL program that is used to run the test. If you wanted to disable the test for a Table/Field change the ‘Yes’ to a ‘No’.

In Check_Descriptions.csv for the CheckName plausibleValueLow, the evaluationFilter is plausibleValueLow!=’’. This applies to the _Field_Level.csv file where the column header is ‘plausibleValueLow’ if the value for that row is not the empty string, then the value in the cell is used as the lower plausibility value.

I think the fields PlausibleGenderNotes etc represent a new feature that allows you to add notes to the DQD output. Someone else will have to provide more detail.