The FeatureExtraction package is used extensively in the Methods Library as well as the PatientLevelPrediction package, and so it is time to think how we can make it more effective. We have a growing wish list:
- Would it be possible to do without the removal of redundant covariates?
- Could FeatureExtraction support characterization of cohorts? And could per-person statistics be skipped if not needed for efficiency?
- A more flexible approach to defining the time window when covariates are captured.
- More flexibility in adding and maintaining features.
Leaving the first wish aside for now, I’ve come up with a new architecture that should address the remaining issues.
At the core of the proposed architecture is a set of SQL files like this, this, and this. These files are a bit complex, but the advantage of this code is that it can be used both to generate per-person statistics, as well as aggregated statistics for the entire cohort, and can be used to generate both regular covariates (one time period per analysis ID) and temporal covariates (multiple time periods per analysis ID). Also, even for the regular covariates the time period is parameterized. By having all this functionality in one place, we can more easily guarantee consistency.
The new covariate settings objects will now just be a data frame with references to these SQL files, for example:
Such settings can be generated using the new createCovariateSettings
function which can be used just as the old one, or people can generate this data frame ‘manually’ for more flexibility.
The new getDbCovariates
function now has arguments aggregated
and temporal
to switch between the type of covariate data to create.
Adding custom covariates is now easier because one can reference additional SQL files when calling FeatureExtraction. I would also allow people to ensure there are no conflicts in analysis IDs and covariate IDs (rather than having FeatureExtraction resolve conflicts automatically), therefore increasing reproducability.
Looking forward to everyone’s thoughts on this, including @Patrick_Ryan, @Rijnbeek, @jennareps, and @Frank.