I think by ‘calibration’ you mean the empirical method evaluation? I’m a bit on the fence about whether it should be included.
On the one hand, if you’re going to fly in a plane you would appreciate it if it wasn’t just that the plane was built by an ISO-9000-compliant company, but also that someone made a test-flight in the plane before you get on board. So running the Method Library through the Methods Benchmark is informative on the validity of the software. Imagine that a method just produces the same estimate for all controls in the Benchmark, that would be considered invalidating the software. Getting the right answer somehow feels like it should be part of our definition of validity.
On the other hand there are the inherent strength and weaknesses of the methods, that are independent of whether they have been implemented correctly. Even the best implementation of case-control will get the answer wrong most of the time, so that should not count towards bad software validity, only bad method validity.
What do other people think?
On expanding section 6: I agree, but I’ve already tried to be as long-winded as I can. Any suggestions on how to be more verbose?