OHDSI Home | Forums | Wiki | Github

Replacing ff with RSQLite?


(Vojtech Huser) #21

Is there a work group around OHDSI R package development

Currently I don’t think there is any such group.

The idea of creating an R user work group is a great idea (in my view). It would not be about any specific package, but about the universe of them. E.g., OhdsiRTools could be further developed. In my view, having more R coders involved with development might be good…


(Martijn Schuemie) #22

There is no formal R developer group (as @Adam_Black asked), and there is also no R user group (as @Vojtech_Huser suggested). In our current organization of open source software development there is a lead for each R package. Mechanisms for others to contribute are the usual: creating issues, creating pull requests, etc., and of course participating in forum discussions such as these.

I prefer to keep it ‘simple’, with a single backend. If we have something that works as ff was intended to work, it would work for everyone. If we have to write our code considering all the different implementations we would be supporting, we just made our life a lot more difficult.


(Mark Seal) #23

I like this idea. I have over a decade of enterprise experience in Java/C++/C#/Python, but I know just enough about R to shoot myself in the foot. I would love to lurk in said user group to learn the design principles of the tools.


(Adam Black) #24

This makes sense. Sounds like RSQLite might be the way to go then.

I would like to contribute but would need to start with something small. I still have a ways to go to understand the OHDSI R ecosystem. I also just have some basic questions like the one I posted here. Is it possible to use odbc instead of jdbc with the OHDSI tools?


(Martijn Schuemie) #25

I just tried to release a new version of DatabaseConnector to CRAN, and it was refused because it depends on an orphaned package (bit).

So I guess we’ll have to speed this up.


(Seng Chan You) #26

Since FeatureExtraction package is the one of the most fundamental packages in OHDSI, if we need to revise, we should do ASAP, even though we’re distracted for the COVID-19 now.


(Martijn Schuemie) #27

Well, we can work without CRAN, it is just annoying.


(Seng Chan You) #28

@schuemie As you know, revision of FeatureExtraction is highly related with reproducibility. And, I think we need to escape from ff. If so, we should do it now, until more packages are developed based on ff in OHDSI, such as clinical characterization package for the covid-19.


(Jreps) #29

One thing I just thought while reading this thread - when we change from ffdf we need to make sure we still have a way to load old data that was saved in ffdf files otherwise we won’t be able to go back easily to view old study data.


(Seng Chan You) #30

@jreps Indeed. I don’t think we can maintain the reproducibility before and after the revision of FeatureExtraction package, and that’s why we need to start to revise it now…


(Martijn Schuemie) #31

We always need to make sure to have the exact same versions of the packages to reproduce a result. This is no different from all other changes we’ve made to our software over the years.

I agree we need to work on this sooner rather than later, but it is not a must-have for any evidence generation we do in the short term.


(Martijn Schuemie) #32

Great news everyone! I’ve created a wrapper around RSQLite I call Andromeda (AsynchroNous Disk-based Representation of MassivE DAta), and have tested it, showing that it indeed is able to handle large objects without breaking memory limits. It is not as fast as ff was (two things are currently slow: saving Andromeda objects as compressed files, and tidying covariates), but I think there are still options to explore to make it faster. And using Andromeda leads to much cleaner code (and therefore more reliable code) than when using ff.

Unfortunately, switching from ff to Andromeda will require rewriting pretty much our entire Methods Library, and I can use some help.

@jreps, @Rijnbeek: could you change the Patient-Level Prediction package?
@msuchard: could you change Cyclops?
I’ll take on CohortMethod and CohortDiagnostics. We can do other packages like SelfControlledCaseSeries at a later point in time.

I have already created Andromeda versions of DatabaseConnector (andromeda branch) and FeatureExtraction (andromeda branch). I recommend we create ‘andromeda’ branches for all our packages, so we can switch all at the same time. I do not want to postpone the switch for too long, because already we see the andromeda and develop branches start to diverge, for example for DatabaseConnector (gotta keep adding code for BigQuery :wink: ).

Here’s how to use Andromeda:

  • I’ve added functions like sqlQueryToAndromeda to DatabaseConnector to download directly into an Andromeda environment. You can see an example of how I use that in FeatureExtraction here.
  • There’s a vignette on Andromeda.
  • FeatureExtraction now creates CovariateData objects that inherit from Andromeda, as you can see here. I suggest we use this same mechanism in PLP and CohortMethod: I intend to make the CohortMethodData object inherit from CovariateData. We could do the same for PlpData.

(Martijn Schuemie) #33

(Note: the saving speed problem was solved, and tidyCovariates now runs 25% faster in the latest release)


(Gowtham Rao) #34

@schuemie thank you for doing this and especially using dplyr syntax!

The intention of this package is to have data available to R functions locally, instead of having to do several I/Os over the network. The local here would be sqlite copy of data from a remote rdbms. It is ideal for designs where the rdbms compute environment is different from R’s compute environment - and analysis requires I/O between the rdbms and R environments.

Out of curiosity - what about if there was a compute environment that allows us to run both SQL and R on the same machine (i.e. no I/Os over the network is required). I think the term used for this concept is ‘in-database processing


(Marc Suchard) #35

@schuemie – amazing work! Yes, I’ll get to work on switching Cyclops away from ff.


(Jreps) #36

This is great - I’m currently updating PLP - might take some time as there are lots of edits, but totally worth it :smile:


(Chris Knoll) #37

Hey, Martijn. Thanks for your hard work on this.

Only criticism (from the API design perspective) is I found it odd that there was a DatabaseConnector depdency on Andromeda where the Andromeda abstraction seems to be more akin to an abstraction of the datatframe vs. a database. For example, this code:

      covariateData <- Andromeda::andromeda()
      DatabaseConnector::querySqlToAndromeda(connection = connection, 
                                             sql = sql, 
                                             andromeda = covariateData, 
                                             andromedaTableName = "covariatesContinuous",
                                             snakeCaseToCamelCase = TRUE)

Is this not simply the same as doing the following without the dependency?

      covariateData <- Andromeda::andromeda()
      covariateData$covariatesContinuous <- DatabaseConnector::querySql(connection = connection, 
                                             sql = sql, 
                                             snakeCaseToCamelCase = TRUE)

(Martijn Schuemie) #38

Hi Chris. I wish it worked like that, but there’s an important difference between your code and mine: In your code, DatabaseConnector::querySql first loads the data in memory in a data frame, and covariateData$covariatesContinuous <- then converts that data frame to an Andromeda (i.e. SQLite) table. If the data is too large to fit in memory you get an error (and likely crash R).

Instead, DatabaseConnector::querySqlToAndromeda in the background loads the data in batches, and adds it to the Andromeda table in batches, so no running out of memory.


(Martijn Schuemie) #39

After several years of radio silence I just received an (automated) e-mail from the ff developers. See text below. They’re de-orphaning bit, and have after 5 years finally fulfilled our request to allow zero-row ffdf objects.

@msuchard, @jreps I propose we stick to Andromeda? That way feature requests can be handled in a more reasonable time frame.

Dear package maintainers,

I hope you and your families do well in corona times.

Packages bit/bit64 and ff have been overhauled and moved from rforge to github/truecluster.

Please check whether your depending package needs to be adapted to copy with some interface changes.

I plan to upload to CRAN in June, please adapt your package until end of May.

Please let me know if you encounter any problems.

Regards and take care

Jens Oehlschlägel

Interface changes

=================

bit:

  • S3methods are no longer exported in NAMESPACE

  • bitwhich() has argument order changed

ff:

  • S3methods are no longer exported in NAMESPACE

Most important user visible changes

===================================

bit:

  • new class hierarchy for boolean types

  • including fully fledged type ‘bitwhich’ for skewed filters

  • dramatically faster methods for integer, e.g. bit_sort 2-10x faster than sort

  • see new vignettes bit_demo, bit_usage, bit_performance

bit64

  • improvements and bug fixes

ff

  • support for 0 length in ff and zero rows in ffdf

(Martijn Schuemie) #40

Good news! Andromeda is now in CRAN!

We now have versions of DatabaseConnector, FeatureExtraction, PatientLevelPrediction, Cyclops, and CohortDiagnostics working with Andromeda (currently in separate branches), and I’m working on CohortMethod. Once that is done I think we should cut over, although we need to minimize the disruption this may cause.

Working with Andromeda, I do think our code will be a lot simpler than before, and therefore more robust and easier to maintain. I’m also trying to make the various objects that previously used ff, such as CovariateData and CohortMethodData, much less of a black box. Andromeda should make it easier for R developers to use these objects in their own functions if they want.


t