There are other motivations driving the selection of variables for Table 1 besides the validity of results/covariate balance. Expert selection ensures that studies are meaningfully connected with prior studies and that knowledge of what drives outcomes accumulates.
I see two potential problems with a data-driven approach to selecting variables for Table 1s:
1) It would imply disregard for the accumulated evidence that the community of researchers use to model the outcome.
2) It would imply greater trust in data-driven solutions to modeling the outcome than might be warranted.
Reviewers would and, I think, should have a problem with this. In many cases they'd ask for a justification and/or require inclusion of an explanation of why a Table 1 excluded covariates of known importance.
Sebastien Haneuse’s work on selection bias is relevant. You know better than I do that the covariate data at hand, however broad, may not cover the phenomena needed to account for group differences in an outcome. In some cases, optimal covariate matching on 10,000 irrelevant variables will fail to balance groups on the one critically important phenomena that accounts for their different outcomes.This is a broad category of risk to the inferences from observational research using secondary data.
The implications of this fact suggest another motivation for Table 1s in observational studies: compare the variables that a primary data collection effort would be sure to include vs. the data that were available to use in the observational study. Standardizing this approach would ensure more rather than less connection to prior research. More specifically, it might insist that Table 1s include a variable list with: A) standard demographic characteristics: Age, gender, race, socioeconomic status regardless of their association with the outcome(s); and B) covariates that have been shown to be correlated with the outcome(s), regardless of whether they were measured directly by any variables used in the analysis. This evidence might be got by scraping in the same way that evidence for drug-outcome pairs have been so brilliantly collected and exposed by Rich and everyone involved with LAERTES.
In any event, the justification for something like this seems clear and widely recognized. The STROBE statement recommends that publications "Indicate number of participants with missing data for each variable of interest".
By extension if there are no data at all for a variable of interest it should not be left out as is typically done, but noted as absent - perhaps by including rows for those missing variables with indicators that they were unmeasured.
The expanded dissemination enabled by OHDSI might go on to define a standardized addendum to Table 1 that summarizes any available evidence relevant to the question of whether measured covariates were sufficient to enable covariate balance. In particular it might summarize the evidence from prior studies in which the pattern of missing variables and the covariate balancing approach was similar to the study being reported.
Bottom line: We might not want to pick an approach that implicitly assumes covariate balance has been achieved. We'll be better off exposing evidence that allows assessment of how likely it is to have been achieved - in keeping with all the other great work you and others have done.