Great news everyone! I’ve created a wrapper around RSQLite I call Andromeda (AsynchroNous Disk-based Representation of MassivE DAta), and have tested it, showing that it indeed is able to handle large objects without breaking memory limits. It is not as fast as ff was (two things are currently slow: saving Andromeda objects as compressed files, and tidying covariates), but I think there are still options to explore to make it faster. And using Andromeda leads to much cleaner code (and therefore more reliable code) than when using ff.
Unfortunately, switching from ff to Andromeda will require rewriting pretty much our entire Methods Library, and I can use some help.
@jreps, @Rijnbeek: could you change the Patient-Level Prediction package?
@msuchard: could you change Cyclops?
I’ll take on CohortMethod and CohortDiagnostics. We can do other packages like SelfControlledCaseSeries at a later point in time.
I have already created Andromeda versions of DatabaseConnector (andromeda branch) and FeatureExtraction (andromeda branch). I recommend we create ‘andromeda’ branches for all our packages, so we can switch all at the same time. I do not want to postpone the switch for too long, because already we see the andromeda and develop branches start to diverge, for example for DatabaseConnector (gotta keep adding code for BigQuery ).
Here’s how to use Andromeda:
- I’ve added functions like sqlQueryToAndromeda to DatabaseConnector to download directly into an Andromeda environment. You can see an example of how I use that in FeatureExtraction here.
- There’s a vignette on Andromeda.
- FeatureExtraction now creates
CovariateData
objects that inherit from Andromeda
, as you can see here. I suggest we use this same mechanism in PLP and CohortMethod: I intend to make the CohortMethodData
object inherit from CovariateData
. We could do the same for PlpData
.