In terms of **utility for a specific question**- yes, that is something I am very interested in and I discuss it in my white paper.

Let’s address it more directly here. Consider each question as its data-generating function Y = H(X, U, W), having treatment effect t(H) = E[H(X, U, 1) - H(X, U, 0)], which generates an observed dataset O(H) = (X, W, Y). We have a set of methods **M** where each method M takes a dataset and produces an estimate of the effect: t’ = M(O(H)).

We want to find the argmin of M(O(H*)) - t(H*) over M in **M** for a specific question H*. *Note that this is totally analogous to the general learning problem: find the argmin of F(x*) - Y(x*) over F in ***F** for a given point x*.

The problem is that we generally don’t have t(H*). If we did, we wouldn’t care what the best inference method for that question is because we already have the treatment effect. Again that’s analogous to the learning problem: we don’t have the value of the function Y at the point x* or else we wouldn’t care about learning it.

We’ll use the same trick that’s used in the machine learning, which is to approximate F(x*) - Y(x*) as E[F(x) - Y(x)] where the expectation is taken over a set of points x in **X** for which we have measurements of Y(x). We can think of these points as forming a neighborhood around x* and we will use them as surrogates and average over the variation.

The fundamental question, and the essential difference between all learning algorithms, is how exactly you decide what is a “neighbor” and how you weigh the surrogate points to average over the variation. Perhaps the simplest algorithm is K-nearest-neighbors, which uses uniform weighting over a set of neighbors defined by a distance: D(x,x*).

The translation to our methods evaluation setting is to find M that minimizes E[M(O(H)) - t(H)] where H are from a set **H** of questions that are “neighbors” to H* for which t(H) is known. To define that set we need a distance metric G(H,H*). We never know the true data-generating functions, but we can do well by using the observed data as a proxy to define a distance metric J(O,O*). What this tells us is that the best datasets to use to evaluate methods for a question at hand are those that are most similar to the dataset at hand. It’s intuitive almost to the point of tautology. The algorithm I lay out in the white paper *does exactly this*. It finds the argmin of J(O(H),O*) subject to t(H) = t over a set of questions H in **H**: it finds the questions that are most statistically similar to the question at hand in terms of the generated datasets, but for which the treatment effect is known. That’s precisely the neighborhood we are looking for.

There is one hitch: The distance G(H,H*) is not perfectly captured in J(O,O*). One part of that is the question of unobserved confounding- if the dataset O contains no information about the unobserved confounding U, then how can we tell if the relationship between U and Y that exists in H* is preserved as much as possible in the generated neighbors H? The answer is that fundamentally we cannot. It is simply not possible because we can never observe that quantity. However, if Y* = H*(X*,U*,W*) is close to Y = H(X*,W*), then (Y*, X*,U*,W*) should not be very different than (Y, X*,U*,W*). That means to say that as long as we remain close to the observed data, the relationships with the unobserved variables will be relatively well preserved. We just can’t quantify by how much.

Is there an alternative? Because of the relative paucity of clinical trials and their quality, we don’t have many, if any, datasets O that are near to O*. There are innumerable differences between any two observational datasets and I would argue that using any real dataset O’ to approximate O* would be further away then using a semi-simulated dataset O because O is generated in a way that specifically minimizes that distance. In other words, it will always be the case that J(O*, O’) > J (O*, O) except under extremely pathological conditions. In addition, the unobserved confounding structure present in H’ is not more likely to mimic that of H: it cannot be shown that G(H*, H’) - J(O*, O’) < G(H*, H) - J(O*, O). In fact, because of the argument about small perturbations not disturbing the unmeasured confounding, it is more likely that G(H*, H’) - J(O*, O’) > G(H*, H) - J(O*, O). The conclusion is that it is difficult to conceive of a case where G(H*,H’) < G(H*,H) for any real clinical questions H* and H’ and a semi-simulated model H based off of H*. It is therefore very difficult to make an argument for using only real data to evaluate the utility of a method for a specific question.

There are many possible ways to include the results for real questions and weight them appropriately relative to the results on semi-simulated datasets (that’s how I conceive of the Bayesian approach Martijn describes). Or, despite my theoretical arguments against it, one might use *only* real data. How can we empirically tell what’s the best way to do it?

Again there is a perfect analogy to the general learning problem. The different strategies and choices are analogous to different learning algorithms that make different assumptions. For instance- the analysis we are proposing to do for a large set of gold-standard RCTs is analogous to finding the mean of the response variable Y and saying F(x) is a constant function that is equal to the mean of Y (one-size-fits-all). Using subsets of those trials or doing a Bayesian analysis gets closer to adaptive learning methods. Making semi-simulated datasets with my algorithm or doing signal injection as before are analogous to data-augmentation techniques that are used in vision and speech processing.

Just as in the general learning problem, we can do hold out a test set or use cross-validation to find the best question-specific method evaluation approach. For instance, to compare the general best method to using my algorithm to find the question-specific best method, one would do the following:

- split the gold standard RCTs 70/30
- find the method that best predicts the treatment effect from the corresponding observational data on the 70% training sample of RCTs
- Find the MSE (or AUC) of using that method to predict treatment effects on the observational data corresponding to the the 30% sample of RCTs
- For each trial in the 30% sample, run my algorithm on the corresponding observational data to find the best method
- then run that dataset-specific method on the observational data to estimate each treatment effect and calculate the error from the real treatment effect
- Average those errors to get the MSE (or AUC) of my algorithm
- Compare the errors from the one-size-fits-all method to the errors from my algorithm

That will essentially tell us if we get anything out of doing question-specific evaluations and if my algorithm is useful in practice. All of this is precisely what I intend to do in parallel with the big general evaluation that we are working on as OHDSI.