In our R packages we often need to store large data objects, objects that can’t always fit into memory. A primary example is the covariate data constructed using the FeatureExtraction package, where for every subject in our study we construct up to tens of thousands of covariates. During the further analysis we often want to subset such a data object in various ways, before finally for example using it in a Cyclops regression.
Up to now we have relied on the ff package, which uses memory mapping to keep parts of the data object on disk, while (to some extends) functioning as a standard data frame. However, support on the ff package has stopped a long time ago, and we have had to use increasing numbers of workarounds to keep it working. Performance on MacOS has dropped dramatically a few years ago due to new memory security models on that OS, with no workaround. Recently, the bit package, an integral part of ff, has officially been designated as ‘orphaned’.
So I think it is time to look at alternatives. The requirements are: ability to store objects larger than can fit in memory, but also quick performance not only in storing data, but also in retrieval, filtering, and merging with other large objects.
There’s the bigmemory package, but this seems suited only for numeric matrices, and its memory-mapping abilities seem limited to Unix. It’s also not clear it supports efficient filtering and merging.
I briefly contemplated writing my own package using the memory-mapping features in Java to make it cross-platform, but that seems like a ridiculous amount of work.
So I’ve started thinking about RSQLite. This is a full relational database platform that runs in R, and uses the file system for storage. I’ve done some tests below, showing RSQLite uses more disk space and takes a bit longer to store data in, but has a smaller memory footprint and quicker data lookup than ff.
What do people think? Would RSQLite be a good replacement?
nRows <- 1e6
data <- data.frame(anInteger = as.integer(runif(nRows, 0, 1e8)),
aReal = runif(nRows),
aString = sapply(1:nRows, function(x) paste(sample(c(0:9, letters, LETTERS), 10, replace = TRUE), collapse = "")),
aDate = as.Date(runif(nRows, 0, 1000), origin = "2000-01-01"))
library(RSQLite)
library(ffbase)
sqliteDb <- dbConnect(RSQLite::SQLite(), "c:/temp/test.sqlite")
system.time(
dbWriteTable(sqliteDb, "data", data)
)
# user system elapsed
# 0.70 0.21 0.90
system.time(
result <- dbGetQuery(sqliteDb, sprintf("SELECT * FROM data WHERE anInteger = %s;", data$anInteger[100]))
)
# user system elapsed
# 0.03 0.04 0.08
system.time(
ffdf <- ff::as.ffdf(data)
)
# user system elapsed
# 0.08 0.03 0.11
system.time(
result <- ffdf[ffdf$anInteger == data$anInteger[100], ]
)
# user system elapsed
# 0.11 0.05 0.17
object.size(sqliteDb)
# 1912 bytes
object.size(ffdf)
# 72016472 bytes
dbDisconnect(sqliteDb)
file.info("c:/temp/test.sqlite")$size
# [1] 41410560
save.ffdf(ffdf, dir = "c:/temp/testFfdf")
files <- list.files("c:/temp/testFfdf", full.names = TRUE)
sum(file.info(files)$size)
# [1] 2.4e+07