Patient Allocation and Cohort Sizes in PheValuator Outputs

jklein06 · May 22, 2025, 12:15pm

Within PheValuator’s outputs following a completed Run of cohort inputs:

How can we view the total number of patients and their IDs used at each step of the PheValuator process- specifically the Training, Testing, and Evaluation cohorts?
We are seeking more transparency into how patients are allocated within each stage of the model development pipeline. Currently, it’s unclear how to map some of the output files in R to the corresponding cohorts described in Parts B and C of the methodology (Swedel et al. 2019, Fig. 1). We hope to know every cohort a unique patient is included in within PheValuator.
How are patients selected for the Evaluation cohort, and can there be overlap with patients used in the Training/Testing or candidate phenotype cohorts (e.g., xSpec, xSens)?
We are trying to understand whether these cohorts are mutually exclusive and how the tool handles overlap between them. This is important for interpreting which output tables refer specifically to the Evaluation vs. Training/Testing phases.
Which R parameters within PheValuator control the size of the Training/Testing and Evaluation cohorts?
For example, if we have identified 500 cases (xSpec) and 10,000 non-cases (i.e., full population minus xSens), and a prevalence of 12% (prevalence cohort ÷ population), what should we expect for the size of the target cohort under different configurations of the PheValuator setup?Within PheValuator’s outputs following a completed Run:

Thanks!

allanwu · May 28, 2025, 8:13pm

Tagging Dr Swerdel to this post @jswerdel
Thanks Jon, on our team, for posting these questions.
We have previously met with Dr Swerdel and we think it might be helpful to post to the community.
Using terminology from your papers, particular Fig 1, Swerdel et al 2019, we would like to follow each person through the PheValuator process (and help audit and best understand how PheValuator is working).
Here’s some of my own comments:
AW: what is the effect of changes in R parameters for: xSpecCohortSize, modelBaseSampleSize, baseSampleSize on the selection of persons in the Training, Testing, and Evaluation Cohorts? We have set xSpecCohort
AW: the bottom line is how do we determine, for a given person ID, after a PheV run, whether that person was
(referencing Swerdel et al 2019, Figure 1):
(a) xSpec (cases) – we can get this from CDM-cohort tables
(b) xSens – we can get this from CDM-cohort tables
(c) non-cases (1 – xSens) – we can get this from CDM-cohort tables
(c) Target cohort overall (combined cases & non-cases limited by prevalence cohort)
(d) Training data persons (within Target cohort – 75% of the Target cohort?)
(e ) Testing data persons (within Target cohort – 25% of the Target cohort?)
(f ) Evaluation cohort (random sample of persons from full data, within Target cohort – do these overlap with Training/Testing)
(g ) Ambiguous cases (not cases or non-cases)
(h ) PLP predictive % of likelihood for all cases (regardless of being in cases, non-cases, training, test, etc) – is the PLP model applied to all?