Is there a work group around OHDSI R package development
Currently I don’t think there is any such group.
The idea of creating an R user work group is a great idea (in my view). It would not be about any specific package, but about the universe of them. E.g., OhdsiRTools could be further developed. In my view, having more R coders involved with development might be good…
There is no formal R developer group (as @Adam_Black asked), and there is also no R user group (as @Vojtech_Huser suggested). In our current organization of open source software development there is a lead for each R package. Mechanisms for others to contribute are the usual: creating issues, creating pull requests, etc., and of course participating in forum discussions such as these.
I prefer to keep it ‘simple’, with a single backend. If we have something that works as ff was intended to work, it would work for everyone. If we have to write our code considering all the different implementations we would be supporting, we just made our life a lot more difficult.
I like this idea. I have over a decade of enterprise experience in Java/C++/C#/Python, but I know just enough about R to shoot myself in the foot. I would love to lurk in said user group to learn the design principles of the tools.
This makes sense. Sounds like RSQLite might be the way to go then.
I would like to contribute but would need to start with something small. I still have a ways to go to understand the OHDSI R ecosystem. I also just have some basic questions like the one I posted here. Is it possible to use odbc instead of jdbc with the OHDSI tools?
Since FeatureExtraction package is the one of the most fundamental packages in OHDSI, if we need to revise, we should do ASAP, even though we’re distracted for the COVID-19 now.
@schuemie As you know, revision of FeatureExtraction is highly related with reproducibility. And, I think we need to escape from ff. If so, we should do it now, until more packages are developed based on ff in OHDSI, such as clinical characterization package for the covid-19.
One thing I just thought while reading this thread - when we change from ffdf we need to make sure we still have a way to load old data that was saved in ffdf files otherwise we won’t be able to go back easily to view old study data.
@jreps Indeed. I don’t think we can maintain the reproducibility before and after the revision of FeatureExtraction package, and that’s why we need to start to revise it now…
We always need to make sure to have the exact same versions of the packages to reproduce a result. This is no different from all other changes we’ve made to our software over the years.
I agree we need to work on this sooner rather than later, but it is not a must-have for any evidence generation we do in the short term.
Great news everyone! I’ve created a wrapper around RSQLite I call Andromeda (AsynchroNous Disk-based Representation of MassivE DAta), and have tested it, showing that it indeed is able to handle large objects without breaking memory limits. It is not as fast as ff was (two things are currently slow: saving Andromeda objects as compressed files, and tidying covariates), but I think there are still options to explore to make it faster. And using Andromeda leads to much cleaner code (and therefore more reliable code) than when using ff.
Unfortunately, switching from ff to Andromeda will require rewriting pretty much our entire Methods Library, and I can use some help.
@jreps, @Rijnbeek: could you change the Patient-Level Prediction package? @msuchard: could you change Cyclops?
I’ll take on CohortMethod and CohortDiagnostics. We can do other packages like SelfControlledCaseSeries at a later point in time.
I have already created Andromeda versions of DatabaseConnector (andromeda branch) and FeatureExtraction (andromeda branch). I recommend we create ‘andromeda’ branches for all our packages, so we can switch all at the same time. I do not want to postpone the switch for too long, because already we see the andromeda and develop branches start to diverge, for example for DatabaseConnector (gotta keep adding code for BigQuery ).
Here’s how to use Andromeda:
I’ve added functions like sqlQueryToAndromeda to DatabaseConnector to download directly into an Andromeda environment. You can see an example of how I use that in FeatureExtraction here.
FeatureExtraction now creates CovariateData objects that inherit from Andromeda, as you can see here. I suggest we use this same mechanism in PLP and CohortMethod: I intend to make the CohortMethodData object inherit from CovariateData. We could do the same for PlpData.
@schuemie thank you for doing this and especially using dplyr syntax!
The intention of this package is to have data available to R functions locally, instead of having to do several I/Os over the network. The local here would be sqlite copy of data from a remote rdbms. It is ideal for designs where the rdbms compute environment is different from R’s compute environment - and analysis requires I/O between the rdbms and R environments.
Out of curiosity - what about if there was a compute environment that allows us to run both SQL and R on the same machine (i.e. no I/Os over the network is required). I think the term used for this concept is ‘in-database processing’
Only criticism (from the API design perspective) is I found it odd that there was a DatabaseConnector depdency on Andromeda where the Andromeda abstraction seems to be more akin to an abstraction of the datatframe vs. a database. For example, this code:
Hi Chris. I wish it worked like that, but there’s an important difference between your code and mine: In your code, DatabaseConnector::querySql first loads the data in memory in a data frame, and covariateData$covariatesContinuous <- then converts that data frame to an Andromeda (i.e. SQLite) table. If the data is too large to fit in memory you get an error (and likely crash R).
Instead, DatabaseConnector::querySqlToAndromeda in the background loads the data in batches, and adds it to the Andromeda table in batches, so no running out of memory.
After several years of radio silence I just received an (automated) e-mail from the ff developers. See text below. They’re de-orphaning bit, and have after 5 years finally fulfilled our request to allow zero-row ffdf objects.
@msuchard, @jreps I propose we stick to Andromeda? That way feature requests can be handled in a more reasonable time frame.
Dear package maintainers,
I hope you and your families do well in corona times.
Packages bit/bit64 and ff have been overhauled and moved from rforge to github/truecluster.
Please check whether your depending package needs to be adapted to copy with some interface changes.
I plan to upload to CRAN in June, please adapt your package until end of May.
Please let me know if you encounter any problems.
Regards and take care
Jens Oehlschlägel
Interface changes
=================
bit:
S3methods are no longer exported in NAMESPACE
bitwhich() has argument order changed
ff:
S3methods are no longer exported in NAMESPACE
Most important user visible changes
===================================
bit:
new class hierarchy for boolean types
including fully fledged type ‘bitwhich’ for skewed filters
dramatically faster methods for integer, e.g. bit_sort 2-10x faster than sort
see new vignettes bit_demo, bit_usage, bit_performance
We now have versions of DatabaseConnector, FeatureExtraction, PatientLevelPrediction, Cyclops, and CohortDiagnostics working with Andromeda (currently in separate branches), and I’m working on CohortMethod. Once that is done I think we should cut over, although we need to minimize the disruption this may cause.
Working with Andromeda, I do think our code will be a lot simpler than before, and therefore more robust and easier to maintain. I’m also trying to make the various objects that previously used ff, such as CovariateData and CohortMethodData, much less of a black box. Andromeda should make it easier for R developers to use these objects in their own functions if they want.