OHDSI Home | Forums | Wiki | Github

Replacing ff with RSQLite?

Usually between 1 and 10GB, although I’ve had objects that were around 32GB. I would use multi-threading, so would access 3 or 4 such objects at a time. But that is an extreme case (and I have 168GB of RAM, so still wouldn’t have needed ff).

1 Like

I can’t work with large datasets locally since my work laptop is pretty under resourced and locked down. We have a remote R server that will to the analytical work and has a ton of RAM. One benefit of using dplyr is that you could support multiple backend technologies including RSQLite (dbplyr), Spark (sparklyr), data.table (dtplyr) with minimal effort. This is the direction RStudio is going and is conceptually similar to generic OHDSI SQL. Then each institution could use the best option for the infrastructure they have, maybe using RSQLite as a default backend.

Is there a work group around OHDSI R package development that I can join?

2 Likes

Is there a work group around OHDSI R package development

Currently I don’t think there is any such group.

The idea of creating an R user work group is a great idea (in my view). It would not be about any specific package, but about the universe of them. E.g., OhdsiRTools could be further developed. In my view, having more R coders involved with development might be good…

1 Like

There is no formal R developer group (as @Adam_Black asked), and there is also no R user group (as @Vojtech_Huser suggested). In our current organization of open source software development there is a lead for each R package. Mechanisms for others to contribute are the usual: creating issues, creating pull requests, etc., and of course participating in forum discussions such as these.

I prefer to keep it ‘simple’, with a single backend. If we have something that works as ff was intended to work, it would work for everyone. If we have to write our code considering all the different implementations we would be supporting, we just made our life a lot more difficult.

I like this idea. I have over a decade of enterprise experience in Java/C++/C#/Python, but I know just enough about R to shoot myself in the foot. I would love to lurk in said user group to learn the design principles of the tools.

This makes sense. Sounds like RSQLite might be the way to go then.

I would like to contribute but would need to start with something small. I still have a ways to go to understand the OHDSI R ecosystem. I also just have some basic questions like the one I posted here. Is it possible to use odbc instead of jdbc with the OHDSI tools?

I just tried to release a new version of DatabaseConnector to CRAN, and it was refused because it depends on an orphaned package (bit).

So I guess we’ll have to speed this up.

Since FeatureExtraction package is the one of the most fundamental packages in OHDSI, if we need to revise, we should do ASAP, even though we’re distracted for the COVID-19 now.

1 Like

Well, we can work without CRAN, it is just annoying.

@schuemie As you know, revision of FeatureExtraction is highly related with reproducibility. And, I think we need to escape from ff. If so, we should do it now, until more packages are developed based on ff in OHDSI, such as clinical characterization package for the covid-19.

One thing I just thought while reading this thread - when we change from ffdf we need to make sure we still have a way to load old data that was saved in ffdf files otherwise we won’t be able to go back easily to view old study data.

1 Like

@jreps Indeed. I don’t think we can maintain the reproducibility before and after the revision of FeatureExtraction package, and that’s why we need to start to revise it now…

We always need to make sure to have the exact same versions of the packages to reproduce a result. This is no different from all other changes we’ve made to our software over the years.

I agree we need to work on this sooner rather than later, but it is not a must-have for any evidence generation we do in the short term.

Great news everyone! I’ve created a wrapper around RSQLite I call Andromeda (AsynchroNous Disk-based Representation of MassivE DAta), and have tested it, showing that it indeed is able to handle large objects without breaking memory limits. It is not as fast as ff was (two things are currently slow: saving Andromeda objects as compressed files, and tidying covariates), but I think there are still options to explore to make it faster. And using Andromeda leads to much cleaner code (and therefore more reliable code) than when using ff.

Unfortunately, switching from ff to Andromeda will require rewriting pretty much our entire Methods Library, and I can use some help.

@jreps, @Rijnbeek: could you change the Patient-Level Prediction package?
@msuchard: could you change Cyclops?
I’ll take on CohortMethod and CohortDiagnostics. We can do other packages like SelfControlledCaseSeries at a later point in time.

I have already created Andromeda versions of DatabaseConnector (andromeda branch) and FeatureExtraction (andromeda branch). I recommend we create ‘andromeda’ branches for all our packages, so we can switch all at the same time. I do not want to postpone the switch for too long, because already we see the andromeda and develop branches start to diverge, for example for DatabaseConnector (gotta keep adding code for BigQuery :wink: ).

Here’s how to use Andromeda:

  • I’ve added functions like sqlQueryToAndromeda to DatabaseConnector to download directly into an Andromeda environment. You can see an example of how I use that in FeatureExtraction here.
  • There’s a vignette on Andromeda.
  • FeatureExtraction now creates CovariateData objects that inherit from Andromeda, as you can see here. I suggest we use this same mechanism in PLP and CohortMethod: I intend to make the CohortMethodData object inherit from CovariateData. We could do the same for PlpData.
4 Likes

(Note: the saving speed problem was solved, and tidyCovariates now runs 25% faster in the latest release)

1 Like

@schuemie thank you for doing this and especially using dplyr syntax!

The intention of this package is to have data available to R functions locally, instead of having to do several I/Os over the network. The local here would be sqlite copy of data from a remote rdbms. It is ideal for designs where the rdbms compute environment is different from R’s compute environment - and analysis requires I/O between the rdbms and R environments.

Out of curiosity - what about if there was a compute environment that allows us to run both SQL and R on the same machine (i.e. no I/Os over the network is required). I think the term used for this concept is ‘in-database processing

@schuemie – amazing work! Yes, I’ll get to work on switching Cyclops away from ff.

1 Like

This is great - I’m currently updating PLP - might take some time as there are lots of edits, but totally worth it :smile:

1 Like

Hey, Martijn. Thanks for your hard work on this.

Only criticism (from the API design perspective) is I found it odd that there was a DatabaseConnector depdency on Andromeda where the Andromeda abstraction seems to be more akin to an abstraction of the datatframe vs. a database. For example, this code:

      covariateData <- Andromeda::andromeda()
      DatabaseConnector::querySqlToAndromeda(connection = connection, 
                                             sql = sql, 
                                             andromeda = covariateData, 
                                             andromedaTableName = "covariatesContinuous",
                                             snakeCaseToCamelCase = TRUE)

Is this not simply the same as doing the following without the dependency?

      covariateData <- Andromeda::andromeda()
      covariateData$covariatesContinuous <- DatabaseConnector::querySql(connection = connection, 
                                             sql = sql, 
                                             snakeCaseToCamelCase = TRUE)

Hi Chris. I wish it worked like that, but there’s an important difference between your code and mine: In your code, DatabaseConnector::querySql first loads the data in memory in a data frame, and covariateData$covariatesContinuous <- then converts that data frame to an Andromeda (i.e. SQLite) table. If the data is too large to fit in memory you get an error (and likely crash R).

Instead, DatabaseConnector::querySqlToAndromeda in the background loads the data in batches, and adds it to the Andromeda table in batches, so no running out of memory.

t