OHDSI Home | Forums | Wiki | Github

Replacing ff with RSQLite?

Because we need to do a lot of post-processing of the data, once we have grabbed it from the (remote) database. For example, before fitting an outcome model we may want to get rid of data of non-informative strata, data that we did need just a few minutes earlier when fitting a propensity model. Yes, we could perform all these operations on the server, but we would get a massive increase in database IO, which currently isn’t our bottleneck, but would surely become so immediately.

Other reasons I like to do analytic steps locally:

  1. I like one data pull to get all the data from the remote server, and then do all processing locally. This vastly increases stability (I think).
  2. We officially support 4 or 5 different database platforms, unofficially a dozen or so, with very different SQL dialects. We can make our life a lot easier if most complex operations are done locally, in a uniform environment, rather than in the remote database jungle.
2 Likes

Makes sense. How large are the analytic datasets in terms of number of rows and/or Gigabytes? If everyone had >=32GB of RAM available on a local analytic machine could we just use in-memory dataframes and the data.table package?

Usually between 1 and 10GB, although I’ve had objects that were around 32GB. I would use multi-threading, so would access 3 or 4 such objects at a time. But that is an extreme case (and I have 168GB of RAM, so still wouldn’t have needed ff).

1 Like

I can’t work with large datasets locally since my work laptop is pretty under resourced and locked down. We have a remote R server that will to the analytical work and has a ton of RAM. One benefit of using dplyr is that you could support multiple backend technologies including RSQLite (dbplyr), Spark (sparklyr), data.table (dtplyr) with minimal effort. This is the direction RStudio is going and is conceptually similar to generic OHDSI SQL. Then each institution could use the best option for the infrastructure they have, maybe using RSQLite as a default backend.

Is there a work group around OHDSI R package development that I can join?

2 Likes

Is there a work group around OHDSI R package development

Currently I don’t think there is any such group.

The idea of creating an R user work group is a great idea (in my view). It would not be about any specific package, but about the universe of them. E.g., OhdsiRTools could be further developed. In my view, having more R coders involved with development might be good…

1 Like

There is no formal R developer group (as @Adam_Black asked), and there is also no R user group (as @Vojtech_Huser suggested). In our current organization of open source software development there is a lead for each R package. Mechanisms for others to contribute are the usual: creating issues, creating pull requests, etc., and of course participating in forum discussions such as these.

I prefer to keep it ‘simple’, with a single backend. If we have something that works as ff was intended to work, it would work for everyone. If we have to write our code considering all the different implementations we would be supporting, we just made our life a lot more difficult.

I like this idea. I have over a decade of enterprise experience in Java/C++/C#/Python, but I know just enough about R to shoot myself in the foot. I would love to lurk in said user group to learn the design principles of the tools.

This makes sense. Sounds like RSQLite might be the way to go then.

I would like to contribute but would need to start with something small. I still have a ways to go to understand the OHDSI R ecosystem. I also just have some basic questions like the one I posted here. Is it possible to use odbc instead of jdbc with the OHDSI tools?

I just tried to release a new version of DatabaseConnector to CRAN, and it was refused because it depends on an orphaned package (bit).

So I guess we’ll have to speed this up.

Since FeatureExtraction package is the one of the most fundamental packages in OHDSI, if we need to revise, we should do ASAP, even though we’re distracted for the COVID-19 now.

1 Like

Well, we can work without CRAN, it is just annoying.

@schuemie As you know, revision of FeatureExtraction is highly related with reproducibility. And, I think we need to escape from ff. If so, we should do it now, until more packages are developed based on ff in OHDSI, such as clinical characterization package for the covid-19.

One thing I just thought while reading this thread - when we change from ffdf we need to make sure we still have a way to load old data that was saved in ffdf files otherwise we won’t be able to go back easily to view old study data.

1 Like

@jreps Indeed. I don’t think we can maintain the reproducibility before and after the revision of FeatureExtraction package, and that’s why we need to start to revise it now…

We always need to make sure to have the exact same versions of the packages to reproduce a result. This is no different from all other changes we’ve made to our software over the years.

I agree we need to work on this sooner rather than later, but it is not a must-have for any evidence generation we do in the short term.

Great news everyone! I’ve created a wrapper around RSQLite I call Andromeda (AsynchroNous Disk-based Representation of MassivE DAta), and have tested it, showing that it indeed is able to handle large objects without breaking memory limits. It is not as fast as ff was (two things are currently slow: saving Andromeda objects as compressed files, and tidying covariates), but I think there are still options to explore to make it faster. And using Andromeda leads to much cleaner code (and therefore more reliable code) than when using ff.

Unfortunately, switching from ff to Andromeda will require rewriting pretty much our entire Methods Library, and I can use some help.

@jreps, @Rijnbeek: could you change the Patient-Level Prediction package?
@msuchard: could you change Cyclops?
I’ll take on CohortMethod and CohortDiagnostics. We can do other packages like SelfControlledCaseSeries at a later point in time.

I have already created Andromeda versions of DatabaseConnector (andromeda branch) and FeatureExtraction (andromeda branch). I recommend we create ‘andromeda’ branches for all our packages, so we can switch all at the same time. I do not want to postpone the switch for too long, because already we see the andromeda and develop branches start to diverge, for example for DatabaseConnector (gotta keep adding code for BigQuery :wink: ).

Here’s how to use Andromeda:

  • I’ve added functions like sqlQueryToAndromeda to DatabaseConnector to download directly into an Andromeda environment. You can see an example of how I use that in FeatureExtraction here.
  • There’s a vignette on Andromeda.
  • FeatureExtraction now creates CovariateData objects that inherit from Andromeda, as you can see here. I suggest we use this same mechanism in PLP and CohortMethod: I intend to make the CohortMethodData object inherit from CovariateData. We could do the same for PlpData.
4 Likes

(Note: the saving speed problem was solved, and tidyCovariates now runs 25% faster in the latest release)

1 Like

@schuemie thank you for doing this and especially using dplyr syntax!

The intention of this package is to have data available to R functions locally, instead of having to do several I/Os over the network. The local here would be sqlite copy of data from a remote rdbms. It is ideal for designs where the rdbms compute environment is different from R’s compute environment - and analysis requires I/O between the rdbms and R environments.

Out of curiosity - what about if there was a compute environment that allows us to run both SQL and R on the same machine (i.e. no I/Os over the network is required). I think the term used for this concept is ‘in-database processing

@schuemie – amazing work! Yes, I’ll get to work on switching Cyclops away from ff.

1 Like

This is great - I’m currently updating PLP - might take some time as there are lots of edits, but totally worth it :smile:

1 Like
t