Replacing ff with RSQLite?

Vojtech_Huser · February 18, 2020, 5:24pm

Is there a work group around OHDSI R package development

Currently I don’t think there is any such group.

The idea of creating an R user work group is a great idea (in my view). It would not be about any specific package, but about the universe of them. E.g., OhdsiRTools could be further developed. In my view, having more R coders involved with development might be good…

schuemie · February 19, 2020, 5:39am

There is no formal R developer group (as @Adam_Black asked), and there is also no R user group (as @Vojtech_Huser suggested). In our current organization of open source software development there is a lead for each R package. Mechanisms for others to contribute are the usual: creating issues, creating pull requests, etc., and of course participating in forum discussions such as these.

I prefer to keep it ‘simple’, with a single backend. If we have something that works as ff was intended to work, it would work for everyone. If we have to write our code considering all the different implementations we would be supporting, we just made our life a lot more difficult.

Mark · February 19, 2020, 2:21pm

I like this idea. I have over a decade of enterprise experience in Java/C++/C#/Python, but I know just enough about R to shoot myself in the foot. I would love to lurk in said user group to learn the design principles of the tools.

Adam_Black · February 21, 2020, 1:32pm

This makes sense. Sounds like RSQLite might be the way to go then.

I would like to contribute but would need to start with something small. I still have a ways to go to understand the OHDSI R ecosystem. I also just have some basic questions like the one I posted here. Is it possible to use odbc instead of jdbc with the OHDSI tools?

schuemie · March 17, 2020, 11:44am

I just tried to release a new version of DatabaseConnector to CRAN, and it was refused because it depends on an orphaned package (bit).

So I guess we’ll have to speed this up.

SCYou · March 17, 2020, 12:35pm

Since FeatureExtraction package is the one of the most fundamental packages in OHDSI, if we need to revise, we should do ASAP, even though we’re distracted for the COVID-19 now.

schuemie · March 17, 2020, 1:16pm

Well, we can work without CRAN, it is just annoying.

SCYou · March 17, 2020, 1:24pm

@schuemie As you know, revision of FeatureExtraction is highly related with reproducibility. And, I think we need to escape from ff. If so, we should do it now, until more packages are developed based on ff in OHDSI, such as clinical characterization package for the covid-19.

jreps · March 17, 2020, 2:48pm

One thing I just thought while reading this thread - when we change from ffdf we need to make sure we still have a way to load old data that was saved in ffdf files otherwise we won’t be able to go back easily to view old study data.

SCYou · March 17, 2020, 3:00pm

@jreps Indeed. I don’t think we can maintain the reproducibility before and after the revision of FeatureExtraction package, and that’s why we need to start to revise it now…

schuemie · March 17, 2020, 3:48pm

We always need to make sure to have the exact same versions of the packages to reproduce a result. This is no different from all other changes we’ve made to our software over the years.

I agree we need to work on this sooner rather than later, but it is not a must-have for any evidence generation we do in the short term.

schuemie · April 21, 2020, 4:34am

Great news everyone! I’ve created a wrapper around RSQLite I call Andromeda (AsynchroNous Disk-based Representation of MassivE DAta), and have tested it, showing that it indeed is able to handle large objects without breaking memory limits. It is not as fast as ff was (two things are currently slow: saving Andromeda objects as compressed files, and tidying covariates), but I think there are still options to explore to make it faster. And using Andromeda leads to much cleaner code (and therefore more reliable code) than when using ff.

Unfortunately, switching from ff to Andromeda will require rewriting pretty much our entire Methods Library, and I can use some help.

@jreps, @Rijnbeek: could you change the Patient-Level Prediction package?
@msuchard: could you change Cyclops?
I’ll take on CohortMethod and CohortDiagnostics. We can do other packages like SelfControlledCaseSeries at a later point in time.

I have already created Andromeda versions of DatabaseConnector (andromeda branch) and FeatureExtraction (andromeda branch). I recommend we create ‘andromeda’ branches for all our packages, so we can switch all at the same time. I do not want to postpone the switch for too long, because already we see the andromeda and develop branches start to diverge, for example for DatabaseConnector (gotta keep adding code for BigQuery ).

Here’s how to use Andromeda:

I’ve added functions like sqlQueryToAndromeda to DatabaseConnector to download directly into an Andromeda environment. You can see an example of how I use that in FeatureExtraction here.
There’s a vignette on Andromeda.
FeatureExtraction now creates CovariateData objects that inherit from Andromeda, as you can see here. I suggest we use this same mechanism in PLP and CohortMethod: I intend to make the CohortMethodData object inherit from CovariateData. We could do the same for PlpData.

schuemie · April 21, 2020, 9:26am

(Note: the saving speed problem was solved, and tidyCovariates now runs 25% faster in the latest release)

Gowtham_Rao · April 21, 2020, 12:20pm

@schuemie thank you for doing this and especially using dplyr syntax!

The intention of this package is to have data available to R functions locally, instead of having to do several I/Os over the network. The local here would be sqlite copy of data from a remote rdbms. It is ideal for designs where the rdbms compute environment is different from R’s compute environment - and analysis requires I/O between the rdbms and R environments.

Out of curiosity - what about if there was a compute environment that allows us to run both SQL and R on the same machine (i.e. no I/Os over the network is required). I think the term used for this concept is ‘in-database processing’

msuchard · April 21, 2020, 1:19pm

@schuemie – amazing work! Yes, I’ll get to work on switching Cyclops away from ff.

jreps · April 21, 2020, 6:57pm

This is great - I’m currently updating PLP - might take some time as there are lots of edits, but totally worth it

Chris_Knoll · April 21, 2020, 7:47pm

Hey, Martijn. Thanks for your hard work on this.

Only criticism (from the API design perspective) is I found it odd that there was a DatabaseConnector depdency on Andromeda where the Andromeda abstraction seems to be more akin to an abstraction of the datatframe vs. a database. For example, this code:

      covariateData <- Andromeda::andromeda()
      DatabaseConnector::querySqlToAndromeda(connection = connection, 
                                             sql = sql, 
                                             andromeda = covariateData, 
                                             andromedaTableName = "covariatesContinuous",
                                             snakeCaseToCamelCase = TRUE)

Is this not simply the same as doing the following without the dependency?

      covariateData <- Andromeda::andromeda()
      covariateData$covariatesContinuous <- DatabaseConnector::querySql(connection = connection, 
                                             sql = sql, 
                                             snakeCaseToCamelCase = TRUE)

schuemie · April 22, 2020, 4:20am

Hi Chris. I wish it worked like that, but there’s an important difference between your code and mine: In your code, DatabaseConnector::querySql first loads the data in memory in a data frame, and covariateData$covariatesContinuous <- then converts that data frame to an Andromeda (i.e. SQLite) table. If the data is too large to fit in memory you get an error (and likely crash R).

Instead, DatabaseConnector::querySqlToAndromeda in the background loads the data in batches, and adds it to the Andromeda table in batches, so no running out of memory.

schuemie · April 30, 2020, 7:00am

After several years of radio silence I just received an (automated) e-mail from the ff developers. See text below. They’re de-orphaning bit, and have after 5 years finally fulfilled our request to allow zero-row ffdf objects.

@msuchard, @jreps I propose we stick to Andromeda? That way feature requests can be handled in a more reasonable time frame.

Dear package maintainers,

I hope you and your families do well in corona times.

Packages bit/bit64 and ff have been overhauled and moved from rforge to github/truecluster.

Please check whether your depending package needs to be adapted to copy with some interface changes.

I plan to upload to CRAN in June, please adapt your package until end of May.

Please let me know if you encounter any problems.

Regards and take care

Jens Oehlschlägel

Interface changes

=================

bit:

S3methods are no longer exported in NAMESPACE

bitwhich() has argument order changed

ff:

S3methods are no longer exported in NAMESPACE

Most important user visible changes

===================================

bit:

new class hierarchy for boolean types

including fully fledged type ‘bitwhich’ for skewed filters

dramatically faster methods for integer, e.g. bit_sort 2-10x faster than sort

see new vignettes bit_demo, bit_usage, bit_performance

bit64

improvements and bug fixes

ff

support for 0 length in ff and zero rows in ffdf

schuemie · May 12, 2020, 7:27am

Good news! Andromeda is now in CRAN!

We now have versions of DatabaseConnector, FeatureExtraction, PatientLevelPrediction, Cyclops, and CohortDiagnostics working with Andromeda (currently in separate branches), and I’m working on CohortMethod. Once that is done I think we should cut over, although we need to minimize the disruption this may cause.

Working with Andromeda, I do think our code will be a lot simpler than before, and therefore more robust and easier to maintain. I’m also trying to make the various objects that previously used ff, such as CovariateData and CohortMethodData, much less of a black box. Andromeda should make it easier for R developers to use these objects in their own functions if they want.