OHDSI Home | Forums | Wiki | Github

What's the difference between "repetitions of 10-fold CV" and "repetitions of data"?

Hello.

In the Atlas Population Level Effect Estimation section there is a Control Settings with these options:

  • Number of random folds to employ in cross validation.
  • Number of repetitions of 10-fold cross validation.
  • Number of repetitions of data for cross validation.

I know what the number of folds is.
But I don’t know the difference between “repetitions of 10-fold CV” and “repetitions of data”. Could somebody explain it, please?

Regards

I’m sure that is a typo. ‘Number of repetitions of 10-fold cross validation’ refers to the number of times the cross-validation is repeated. This is really only necessary for small data, where the random sampling of the cross-validation might get you into trouble if you do it only once.

Number of repetitions of data for cross validation’ should really be ‘Minimum amount of data for cross-validation’. As an extreme example, if you only have 5 people in the comparator, and you do a 10-fold cross-validation, some folds will have no people in the comparator group and not produce valid estimates. By default we require 100 people in the smallest cohort, so on average 10 per fold. If the number is lower, an error will be thrown (silently, simply resulting in NA estimates).

1 Like

Added a Github issue for this: https://github.com/OHDSI/Atlas/issues/2870

Thanks for the issue report, we’ve made a fix and it will be in the next release.

t