Hi all,
Just wanted to highlight a new feature I’ve added to the DatabaseConnector R package: the ability to bulk load large data to a new permanent table in PDW or Redshift using methods more efficient than insert statements, which are notoriously slow in those systems.
In PDW, the DwLoader tool is the most efficient way to bulk load data, while in Redshift, it’s the load to S3/run COPY command.
Per @anthonysena 's suggestion, I’d like to open this up to users/developers of other MPP systems (Netezza, Impala, BigQuery). @tomwhite @Gowtham_Rao @Robert_Carroll @MatthewYoungLai – do you think there’s analogous bulk loading techniques that would be useful to add to DatabaseConnector for inserting data?
Here’s how the current setup is for PDW and Redshift:
bulk data insert with Redshift or PDW – with parameter explicitly set
insertTable(connection = connection, tableName = "scratch.somedata", data = data, dropTableIfExists = TRUE, createTable = TRUE, tempTable = FALSE, useMppBulkLoad = TRUE)
bulk data insert with Redshift or PDW – with system environment variable set
This is useful when utilizing other OHDSI R packages that leverage insertTable but aren’t designed to try MPP bulk loading
Sys.setenv("USE_MPP_BULK_LOAD" = TRUE) insertTable(connection = connection, tableName = "scratch.somedata", data = data, dropTableIfExists = TRUE, createTable = TRUE, tempTable = FALSE)
In PDW, you need to set (and have the application installed locally):
Sys.setenv("DWLOADER_PATH" = "C:/Program Files/Microsoft SQL Server Parallel Data Warehouse/100/DWLoader.exe")
In Redshift you need to set Amazon S3 credentials:
Sys.setenv("AWS_ACCESS_KEY_ID" = "some_access_key_id", "AWS_SECRET_ACCESS_KEY" = "some_secret_access_key", "AWS_DEFAULT_REGION" = "some_aws_region", "AWS_BUCKET_NAME" = "some_bucket_name", "AWS_OBJECT_KEY" = "some_object_key", "AWS_SSE_TYPE" = "server_side_encryption_type")
Thanks,
Ajit