Replacing ff with RSQLite?

spfohl · January 31, 2020, 5:47pm

How about something built on top of Apache Arrow (https://arrow.apache.org/)? It’s a cross-platform standard that has a lot of buy in from the community. It may not be straightforward to replicate the functionality that ff currently provides though.

Gowtham_Rao · February 3, 2020, 1:01am

dbplyr supports multiple rdbms
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html

dplyr generates sql, and translate it across sql dialects. Does using dplyr, allow us to take a uniform approach across data frame and rdms? i.e. not have to write SQL at all?

schuemie · February 3, 2020, 6:09am

This idea has been floating around for a few years. I even got it to work at one point, but then the internals of dbplyr changed and it broke. It is not a high priority for me (for example replacing ff is a much higher priority, because soon we may find none of our tools work anymore).

Also note that not all we can do in SQL + SqlRender can be done in dbplyr.

SCYou · February 3, 2020, 6:28am

Though @Gowtham_Rao 's idea is so fabulous, I don’t think the integration of dbplyr and OHDSI SqlRender/ DBConnector would be stable, either. We cannot control the development of dbplyr. We can use both

Gowtham_Rao · February 3, 2020, 9:43am

If we opt to use RSQlite in place of FF - would it then be reasonable for us to make it a convention that we use dbplyr/dplyr for create/read/update/delete (CRUD) tasks on data in RSQlite?

i.e. for example: we construct covariate data using dbplyr/dplyr.

Why is development using dbplyr/dplyr and not SQL important? Our data is only going to get bigger. And, maybe, the large objects that we are trying solve for in RSQlite, would be bigger than what RSQlite could support. So, if we decide to switch - then if our development was in dbplyr/dplyr - we just have to switch the end point from RSQlite to something else that is supported by dbplyr/dplyr (maybe another dedicated rdbms appliance). Plus, this may reduce the need to move data over network.

Follow-up question: if we do opt to use SQL to do CRUD with RSQLite - should we use OHDSISql with translate using SQLRender?

schuemie · February 3, 2020, 1:33pm

Thanks for pointing to arrow! I hadn’t heard of it before. It does seem this technology is a bit immature (see test below), and I guess it doesn’t solve our problem of storing things that are too big for memory?

system.time(
  arrowTable <- arrow::Table$create(data)
)
# user  system elapsed 
# 0.19    0.02    0.20 
# Nice!

system.time(
  result <- arrowTable[arrowTable$anInteger == data$anInteger[100], ]
)
# Crashes R

object.size(arrowTable)
# 416 bytes
# Not sure how to measure the memory footprint

Vojtech_Huser · February 3, 2020, 6:42pm

I am also a big fan of dbplyr!

spfohl · February 6, 2020, 5:32pm

That’s a shame that the R version doesn’t appear as mature. In python, arrow allows you to work nicely with the Apache parquet file format, which allows for partitioned datasets on disk, and has a high rate of compression, fast IO, and can represent out of memory datasets with tools like Dask or Spark. I have a prototype pipeline that goes FeatureExtraction -> ffdf -> csv -> parquet and the parquet files are roughly ~30% of the size of the respective csv or ff files.

Regardless, it looks like there isn’t really an Arrow/parquet solution that plays nicely with the current R ecosystem.

Adam_Black · February 17, 2020, 3:33pm

Hi everyone,

I’m fairly new to OHDSI and am just getting started using our CDM at Maine Medical Center. I have been using dbplyr for several years and absolutely love it. The way it provides a readable language that abstracts away differences between in-memory dataframes and SQL tables is pretty stellar. Also the developers at RStudio seem to do really high quality work so building on top of their work makes sense. That said I don’t think I would use dbplyr for CRUD operations. It seems best suited for Read operations and not for Create, Update, Delete operations.

I think RStudio is working with the Apache Arrow project but it sounds like that might not meet the need in the short term.

On (naive) question I have about this discussion is why we would take data out of a CDM database and put it into another database. Could we not just use temp tables in the CDM database to store the large result tables rather than loading them into SQLite?

schuemie · February 17, 2020, 4:17pm

Because we need to do a lot of post-processing of the data, once we have grabbed it from the (remote) database. For example, before fitting an outcome model we may want to get rid of data of non-informative strata, data that we did need just a few minutes earlier when fitting a propensity model. Yes, we could perform all these operations on the server, but we would get a massive increase in database IO, which currently isn’t our bottleneck, but would surely become so immediately.

Other reasons I like to do analytic steps locally:

I like one data pull to get all the data from the remote server, and then do all processing locally. This vastly increases stability (I think).
We officially support 4 or 5 different database platforms, unofficially a dozen or so, with very different SQL dialects. We can make our life a lot easier if most complex operations are done locally, in a uniform environment, rather than in the remote database jungle.

Adam_Black · February 17, 2020, 5:10pm

Makes sense. How large are the analytic datasets in terms of number of rows and/or Gigabytes? If everyone had >=32GB of RAM available on a local analytic machine could we just use in-memory dataframes and the data.table package?

schuemie · February 18, 2020, 7:25am

Usually between 1 and 10GB, although I’ve had objects that were around 32GB. I would use multi-threading, so would access 3 or 4 such objects at a time. But that is an extreme case (and I have 168GB of RAM, so still wouldn’t have needed ff).

Adam_Black · February 18, 2020, 2:53pm

I can’t work with large datasets locally since my work laptop is pretty under resourced and locked down. We have a remote R server that will to the analytical work and has a ton of RAM. One benefit of using dplyr is that you could support multiple backend technologies including RSQLite (dbplyr), Spark (sparklyr), data.table (dtplyr) with minimal effort. This is the direction RStudio is going and is conceptually similar to generic OHDSI SQL. Then each institution could use the best option for the infrastructure they have, maybe using RSQLite as a default backend.

Is there a work group around OHDSI R package development that I can join?

Vojtech_Huser · February 18, 2020, 5:25pm

Is there a work group around OHDSI R package development

Currently I don’t think there is any such group.

The idea of creating an R user work group is a great idea (in my view). It would not be about any specific package, but about the universe of them. E.g., OhdsiRTools could be further developed. In my view, having more R coders involved with development might be good…

schuemie · February 19, 2020, 5:39am

There is no formal R developer group (as @Adam_Black asked), and there is also no R user group (as @Vojtech_Huser suggested). In our current organization of open source software development there is a lead for each R package. Mechanisms for others to contribute are the usual: creating issues, creating pull requests, etc., and of course participating in forum discussions such as these.

I prefer to keep it ‘simple’, with a single backend. If we have something that works as ff was intended to work, it would work for everyone. If we have to write our code considering all the different implementations we would be supporting, we just made our life a lot more difficult.

Mark · February 19, 2020, 2:21pm

I like this idea. I have over a decade of enterprise experience in Java/C++/C#/Python, but I know just enough about R to shoot myself in the foot. I would love to lurk in said user group to learn the design principles of the tools.

Adam_Black · February 21, 2020, 1:33pm

This makes sense. Sounds like RSQLite might be the way to go then.

I would like to contribute but would need to start with something small. I still have a ways to go to understand the OHDSI R ecosystem. I also just have some basic questions like the one I posted here. Is it possible to use odbc instead of jdbc with the OHDSI tools?

schuemie · March 17, 2020, 11:44am

I just tried to release a new version of DatabaseConnector to CRAN, and it was refused because it depends on an orphaned package (bit).

So I guess we’ll have to speed this up.

SCYou · March 17, 2020, 12:35pm

Since FeatureExtraction package is the one of the most fundamental packages in OHDSI, if we need to revise, we should do ASAP, even though we’re distracted for the COVID-19 now.

schuemie · March 17, 2020, 1:16pm

Well, we can work without CRAN, it is just annoying.