64-bit integers in HADES

schuemie · October 28, 2020, 8:10am

As you may know, R only supports 32-bit integers, while a lot of our data has integer IDs and even counts that are larger than fits in 32 bits. Our solution up till now has been to represent these numbers as numeric (double), which is far from ideal, potentially leading to information loss.

The next version of DatabaseConnector will introduce support for 64-bit integers using the bit64 package (similar to several packages in the tidyVerse). This means BIGINT values on the server will be imported as integer64 types in R. Most packages, such as dplyr, Andromeda, and FeatureExtraction, readily work with these integers, and Cyclops is being updated for this.

One part of R that does not work well with these new integers is reading from CSV files (writing works fine): when for example using readr::read_csv, these numbers are still interpreted as numeric. The developers of readr are aware of this issue, but have no plans to fix it anytime soon. So for now, the only solution I see is to force these fields to character, and convert them to integer64 afterwards, e.g.:

# Force the covariateId column to character. All other column types are guessed:
covariate <- readr::read_csv("covariate.csv", col_types = c(covariateId = "c"))

# Convert to integer64:
covariate$covariateId  <- bit64::as.integer64(covariate$covariateId)

If anyone knows a more elegant solution, please let me know.

@jweave17, @jennareps, @Gowtham_Rao, @anthonysena: this will have consequences for our packages (and package skeletons) that produce CSV files locally for sharing, and then need to read them centrally. Could you update them?