Replacing ff with RSQLite?

schuemie · January 28, 2020, 10:53am

In our R packages we often need to store large data objects, objects that can’t always fit into memory. A primary example is the covariate data constructed using the FeatureExtraction package, where for every subject in our study we construct up to tens of thousands of covariates. During the further analysis we often want to subset such a data object in various ways, before finally for example using it in a Cyclops regression.

Up to now we have relied on the ff package, which uses memory mapping to keep parts of the data object on disk, while (to some extends) functioning as a standard data frame. However, support on the ff package has stopped a long time ago, and we have had to use increasing numbers of workarounds to keep it working. Performance on MacOS has dropped dramatically a few years ago due to new memory security models on that OS, with no workaround. Recently, the bit package, an integral part of ff, has officially been designated as ‘orphaned’.

So I think it is time to look at alternatives. The requirements are: ability to store objects larger than can fit in memory, but also quick performance not only in storing data, but also in retrieval, filtering, and merging with other large objects.

There’s the bigmemory package, but this seems suited only for numeric matrices, and its memory-mapping abilities seem limited to Unix. It’s also not clear it supports efficient filtering and merging.

I briefly contemplated writing my own package using the memory-mapping features in Java to make it cross-platform, but that seems like a ridiculous amount of work.

So I’ve started thinking about RSQLite. This is a full relational database platform that runs in R, and uses the file system for storage. I’ve done some tests below, showing RSQLite uses more disk space and takes a bit longer to store data in, but has a smaller memory footprint and quicker data lookup than ff.

What do people think? Would RSQLite be a good replacement?

nRows <- 1e6

data <- data.frame(anInteger = as.integer(runif(nRows, 0, 1e8)),
                   aReal = runif(nRows),
                   aString = sapply(1:nRows, function(x) paste(sample(c(0:9, letters, LETTERS), 10, replace = TRUE), collapse = "")),
                   aDate = as.Date(runif(nRows, 0, 1000), origin = "2000-01-01"))

library(RSQLite)
library(ffbase)

sqliteDb <- dbConnect(RSQLite::SQLite(), "c:/temp/test.sqlite")
system.time(
  dbWriteTable(sqliteDb, "data", data)
)
# user  system elapsed 
# 0.70    0.21    0.90

system.time(
  result <- dbGetQuery(sqliteDb, sprintf("SELECT * FROM data WHERE anInteger = %s;", data$anInteger[100]))
)
# user  system elapsed 
# 0.03    0.04    0.08 

system.time(
  ffdf <- ff::as.ffdf(data)
)
# user  system elapsed 
# 0.08    0.03    0.11 

system.time(
  result <- ffdf[ffdf$anInteger == data$anInteger[100], ]
)
# user  system elapsed 
# 0.11    0.05    0.17

object.size(sqliteDb)
# 1912 bytes

object.size(ffdf)
# 72016472 bytes

dbDisconnect(sqliteDb)
file.info("c:/temp/test.sqlite")$size
# [1] 41410560

save.ffdf(ffdf, dir = "c:/temp/testFfdf")
files <- list.files("c:/temp/testFfdf", full.names = TRUE)
sum(file.info(files)$size)
# [1] 2.4e+07

Chris_Knoll · January 28, 2020, 4:54pm

I like the idea, my only concern is the parity of features between dealing with a dataframe vs sql. I think it would be amazing if you could have a ‘data frame backed by sqllite’ so that something like

ffdf[ffdf$anInteger == data$anInteger[100], ]

transforms to an underlying SQL expression against the datastore. I think we’d miss the ability to use librarys like dplyr that do some magic with dataframes that we’d have to re-invent in SQL. Note, I think sql is very powerful and commonly I have times where the sql syntax of GROP BY … COUNT() is much more natrual than what I can do out of dataframes.

But, maybe there’s a middle ground where you can ftech the necessary subset of data out of the SQL-interface and do additional in-memory operations on a standard R dataframe.

So, in conclusion, I think it’s a good idea, and anyone can dive into SQL or R-functions based on the context and the problem at hand.

SCYou · January 28, 2020, 11:10pm

system.time(
    dbWriteTable(sqliteDb, "data", data)
)
#user  system elapsed 
#0.919   0.057   0.996 

system.time(
    result <- dbGetQuery(sqliteDb, sprintf("SELECT * FROM data WHERE anInteger = %s;", data$anInteger[100]))
)
# user  system elapsed 
# 0.055   0.016   0.075 

system.time(
    ffdf <- ff::as.ffdf(data)
)
# user  system elapsed 
# 0.113   0.040   0.164 

system.time(
    result <- ffdf[ffdf$anInteger == data$anInteger[100], ]
)
# user  system elapsed 
# 0.143   0.025   0.179

This is the result from my MacBook. @schuemie
As you said, SQLite is quicker to find and extract data.

The resultant DB of SQLite seems much more solid than that of ff, and it can be used in Python by using sqlite3.

Since it was hard for me to handle ff files, it would be much easier to use SQLite for me.

Still, many libraries depending on FeatureExtraction including PatientLevelPrediction have actively used ff package. Would it be possible to upgrade them all together to move from ff to SQLite?

schuemie · January 29, 2020, 5:02am

Well, there’s dbplyr that works seamlessly with RSQLite, and allows you to write things like

flights_db %>% 
  group_by(dest) %>%
  summarise(delay = mean(dep_time))

An extreme option would be to write a wrapper around RSQLite that would look and feel like a data frame, like the ffbase package did for ff. But that would require implementing many methods, like you can see here in ffbase. That would not only be a lot of work, but there’s an inherent mismatch between a SQL database and a data frame, that is that a data frame can easily be decomposed into separate vectors, and a database table can’t. So taking the analogy to far might be hard.

I am thinking of something in between: a wrapper at least allows easy ‘loading’ and ‘saving’ of the SQLite databases using compression. Loading a database would extract it to a temp folder, where you could modify it as you would an in-memory object without affecting the saved object. There’s also the problem that SQLite doesn’t handle dates very well.

We’d need to rewrite all our packages that use ff to work with the new data structure, no matter what we choose. We have a lot of ff-specific code like this that wouldn’t translate to any solution we choose.

‘Luckily’, because of all the problems we’ve had with ff in the past, we’ve only used it when absolutely necessary, which has primarily been to store covariate data. So the damage would be limited.

Vojtech_Huser · January 30, 2020, 1:46am

The dates problem seems like a significant issue.

schuemie · January 30, 2020, 4:16am

Not really. You just need to convert dates to integers and back. I already hacked this into DatabaseConnector so it appears SQLite does support dates, in this case specifically so we could use it for Eunomia. The catch in this case is your date field name should end with ‘_date’, but you could handle it in other ways too.

Vojtech_Huser · January 31, 2020, 2:17pm

Well, there is a dying package and the list of replacement has only one candidate on the list (that seems to satisfy all conditions). You are probably the best person to know possible candidates and the complexities needed. The decision seems clear that we go with the only candidate. Long live RSQLite !

spfohl · January 31, 2020, 5:47pm

How about something built on top of Apache Arrow (https://arrow.apache.org/)? It’s a cross-platform standard that has a lot of buy in from the community. It may not be straightforward to replicate the functionality that ff currently provides though.

Gowtham_Rao · February 3, 2020, 12:52am

dbplyr supports multiple rdbms
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html

dplyr generates sql, and translate it across sql dialects. Does using dplyr, allow us to take a uniform approach across data frame and rdms? i.e. not have to write SQL at all?

schuemie · February 3, 2020, 6:09am

This idea has been floating around for a few years. I even got it to work at one point, but then the internals of dbplyr changed and it broke. It is not a high priority for me (for example replacing ff is a much higher priority, because soon we may find none of our tools work anymore).

Also note that not all we can do in SQL + SqlRender can be done in dbplyr.

SCYou · February 3, 2020, 6:28am

Though @Gowtham_Rao 's idea is so fabulous, I don’t think the integration of dbplyr and OHDSI SqlRender/ DBConnector would be stable, either. We cannot control the development of dbplyr. We can use both

Gowtham_Rao · February 3, 2020, 9:31am

If we opt to use RSQlite in place of FF - would it then be reasonable for us to make it a convention that we use dbplyr/dplyr for create/read/update/delete (CRUD) tasks on data in RSQlite?

i.e. for example: we construct covariate data using dbplyr/dplyr.

Why is development using dbplyr/dplyr and not SQL important? Our data is only going to get bigger. And, maybe, the large objects that we are trying solve for in RSQlite, would be bigger than what RSQlite could support. So, if we decide to switch - then if our development was in dbplyr/dplyr - we just have to switch the end point from RSQlite to something else that is supported by dbplyr/dplyr (maybe another dedicated rdbms appliance). Plus, this may reduce the need to move data over network.

Follow-up question: if we do opt to use SQL to do CRUD with RSQLite - should we use OHDSISql with translate using SQLRender?

schuemie · February 3, 2020, 1:33pm

Thanks for pointing to arrow! I hadn’t heard of it before. It does seem this technology is a bit immature (see test below), and I guess it doesn’t solve our problem of storing things that are too big for memory?

system.time(
  arrowTable <- arrow::Table$create(data)
)
# user  system elapsed 
# 0.19    0.02    0.20 
# Nice!

system.time(
  result <- arrowTable[arrowTable$anInteger == data$anInteger[100], ]
)
# Crashes R

object.size(arrowTable)
# 416 bytes
# Not sure how to measure the memory footprint

Vojtech_Huser · February 3, 2020, 6:42pm

I am also a big fan of dbplyr!

spfohl · February 6, 2020, 5:32pm

That’s a shame that the R version doesn’t appear as mature. In python, arrow allows you to work nicely with the Apache parquet file format, which allows for partitioned datasets on disk, and has a high rate of compression, fast IO, and can represent out of memory datasets with tools like Dask or Spark. I have a prototype pipeline that goes FeatureExtraction -> ffdf -> csv -> parquet and the parquet files are roughly ~30% of the size of the respective csv or ff files.

Regardless, it looks like there isn’t really an Arrow/parquet solution that plays nicely with the current R ecosystem.

Adam_Black · February 17, 2020, 3:33pm

Hi everyone,

I’m fairly new to OHDSI and am just getting started using our CDM at Maine Medical Center. I have been using dbplyr for several years and absolutely love it. The way it provides a readable language that abstracts away differences between in-memory dataframes and SQL tables is pretty stellar. Also the developers at RStudio seem to do really high quality work so building on top of their work makes sense. That said I don’t think I would use dbplyr for CRUD operations. It seems best suited for Read operations and not for Create, Update, Delete operations.

I think RStudio is working with the Apache Arrow project but it sounds like that might not meet the need in the short term.

On (naive) question I have about this discussion is why we would take data out of a CDM database and put it into another database. Could we not just use temp tables in the CDM database to store the large result tables rather than loading them into SQLite?

schuemie · February 17, 2020, 4:17pm

Because we need to do a lot of post-processing of the data, once we have grabbed it from the (remote) database. For example, before fitting an outcome model we may want to get rid of data of non-informative strata, data that we did need just a few minutes earlier when fitting a propensity model. Yes, we could perform all these operations on the server, but we would get a massive increase in database IO, which currently isn’t our bottleneck, but would surely become so immediately.

Other reasons I like to do analytic steps locally:

I like one data pull to get all the data from the remote server, and then do all processing locally. This vastly increases stability (I think).
We officially support 4 or 5 different database platforms, unofficially a dozen or so, with very different SQL dialects. We can make our life a lot easier if most complex operations are done locally, in a uniform environment, rather than in the remote database jungle.

Adam_Black · February 17, 2020, 4:27pm

Makes sense. How large are the analytic datasets in terms of number of rows and/or Gigabytes? If everyone had >=32GB of RAM available on a local analytic machine could we just use in-memory dataframes and the data.table package?

schuemie · February 18, 2020, 7:25am

Usually between 1 and 10GB, although I’ve had objects that were around 32GB. I would use multi-threading, so would access 3 or 4 such objects at a time. But that is an extreme case (and I have 168GB of RAM, so still wouldn’t have needed ff).

Adam_Black · February 18, 2020, 2:53pm

I can’t work with large datasets locally since my work laptop is pretty under resourced and locked down. We have a remote R server that will to the analytical work and has a ton of RAM. One benefit of using dplyr is that you could support multiple backend technologies including RSQLite (dbplyr), Spark (sparklyr), data.table (dtplyr) with minimal effort. This is the direction RStudio is going and is conceptually similar to generic OHDSI SQL. Then each institution could use the best option for the infrastructure they have, maybe using RSQLite as a default backend.

Is there a work group around OHDSI R package development that I can join?