MPP bulk loading in DatabaseConnector::insertTable

Ajit_Londhe · February 22, 2018, 6:13pm

Hi all,

Just wanted to highlight a new feature I’ve added to the DatabaseConnector R package: the ability to bulk load large data to a new permanent table in PDW or Redshift using methods more efficient than insert statements, which are notoriously slow in those systems.

In PDW, the DwLoader tool is the most efficient way to bulk load data, while in Redshift, it’s the load to S3/run COPY command.

Per @anthonysena 's suggestion, I’d like to open this up to users/developers of other MPP systems (Netezza, Impala, BigQuery). @tomwhite @Gowtham_Rao @Robert_Carroll @MatthewYoungLai – do you think there’s analogous bulk loading techniques that would be useful to add to DatabaseConnector for inserting data?

Here’s how the current setup is for PDW and Redshift:

bulk data insert with Redshift or PDW – with parameter explicitly set
insertTable(connection = connection, tableName = "scratch.somedata", data = data, dropTableIfExists = TRUE, createTable = TRUE, tempTable = FALSE, useMppBulkLoad = TRUE)

bulk data insert with Redshift or PDW – with system environment variable set
This is useful when utilizing other OHDSI R packages that leverage insertTable but aren’t designed to try MPP bulk loading

Sys.setenv("USE_MPP_BULK_LOAD" = TRUE) insertTable(connection = connection, tableName = "scratch.somedata", data = data, dropTableIfExists = TRUE, createTable = TRUE, tempTable = FALSE)

In PDW, you need to set (and have the application installed locally):
Sys.setenv("DWLOADER_PATH" = "C:/Program Files/Microsoft SQL Server Parallel Data Warehouse/100/DWLoader.exe")

In Redshift you need to set Amazon S3 credentials:
Sys.setenv("AWS_ACCESS_KEY_ID" = "some_access_key_id", "AWS_SECRET_ACCESS_KEY" = "some_secret_access_key", "AWS_DEFAULT_REGION" = "some_aws_region", "AWS_BUCKET_NAME" = "some_bucket_name", "AWS_OBJECT_KEY" = "some_object_key", "AWS_SSE_TYPE" = "server_side_encryption_type")

Thanks,
Ajit

Robert_Carroll · April 20, 2018, 8:29pm

Great question. I haven’t tried any bulk loading directly through R.

The nzload utility is commonly used for bulk loading from command line; I’m not sure how easy it would be to integrate into the R setting.

Using the standard drivers, there is a remote external table option that might be more feasible. I took a look at the code and seems like the remote external table option would work well given the use of saving local tables to upload without requiring a special utility. See this page and the “Remote transient external tables” section at the bottom.