OHDSI Home | Forums | Wiki | Github

Databricks (Spark) coming to OHDSI stack

Currently we are doing a prototype where we are transforming data via Azure Pipelines (Spark) from FHIR R4 to OMOP. We have deployed the CDM schema manually using the DDL scripts.
The steps/tweaks I had followed/used to deploy Broadsea containers and to enable Atlas, WebAPI and Achilles to work with Databricks is outlined here.

Hi, @venkyvb,
I noticed in the setup documentation you mentioned how the JDBC driver that WebAPI uses:

Note that upto the version 2.6.25 Databricks JDBC Driver is based on the Simba JDBC driver. Post 2.6.25 this has been changed to Databricks own JDBC driver. Using version > 2.6.25 would require changes in the WebAPI DataAccessConfig, which would need to be tested (which may require additional changes to the WebAPI itself).

I don’t expect any WebAPI change (asside from loading the correct JDBC driver here). If the newer spark JDBC driver can be downloaded from maven central, even better. If you’d like to see the driver updated, please open an issue in the WebAPI repo.

Update: it looks like the latest JDBC driver is available on maven! We should try to use that instead of forcing the additional step to download the driver externally.

Thanks for confirmation. Let me create a issue for this in the WebAPI repo. And yes, if we switch to the Databricks driver we can just use the maven central instead of needed to manually download the driver.

Thinking aloud, wondering if it would make sense to add Databricks driver in addition to the existing Spark driver. This way other (non-Databricks) Spark engines with Delta lake as the storage format can be supported as well.

One thing we may need to verify is whether the jdbc:spark:// URI would work with the Databricks driver (the Databricks driver doc seems to refer to jdbc:databricks:// URI). If this doesn’t work then there would be changes needed in other modules like Achilles (e.g. here and here), where there is specific check for the Spark JBDC URI path.

Well, it’s better that they changed the protocol in the JDBC url from spark with the old driver to databricks with the new one, that makes it simpler. We can leave the existing driver loading alone, and just add the new one for databricks.

Hi team,
Is there any solution available now to fix “[Simba]JDBC Cannot use commit while Connection is in auto-commit mode.” issue when run R execute function in Databricks CDM tables? Below is a full error message. Thanks!

Error in rJava::.jcall(conn@jConnection, “Ljava/sql/DatabaseMetaData;”, : java.sql.SQLException: [Simba]JDBC Cannot use commit while Connection is in auto-commit mode.

  1. stop(structure(list(message = “java.sql.SQLException: [Simba]JDBC Cannot use commit while Connection is in auto-commit mode.”, call = rJava::.jcall(conn@jConnection, “Ljava/sql/DatabaseMetaData;”, “getMetaData”), jobj = new(“jobjRef”, jobj = <pointer: 0x00000236d214c930>, jclass = “java/sql/SQLException”)), class = c(“SQLException”, …

  2. .jcheck()

  3. rJava::.jcall(conn@jConnection, “Ljava/sql/DatabaseMetaData;”, “getMetaData”)

  4. .local(conn, …)

  5. DBI::dbListTables(conn = connection, databaseSchema = databaseSchema)

  6. DBI::dbListTables(conn = connection, databaseSchema = databaseSchema)

  7. DatabaseConnector::getTableNames(connection, cohortDatabaseSchema)

  8. CohortGenerator::generateCohortSet(connection = connection, cohortDatabaseSchema = cohortDatabaseSchema, cohortTableNames = cohortTableNames, cdmDatabaseSchema = cdmDatabaseSchema, tempEmulationSchema = tempEmulationSchema, cohortDefinitionSet = cohortDefinitionSet) at

CreateAllCohorts.R#35

  1. createCohorts(connectionDetails = connectionDetails, cdmDatabaseSchema = cdmDatabaseSchema, cohortDatabaseSchema = cohortDatabaseSchema, cohortTableNames = list(cohortTable = cohortTable, cohortInclusionTable = cohortInclusionTable, cohortInclusionResultTable = cohortInclusionResultTable, cohortInclusionStatsTable = cohortInclusionStatsTable, … at

Main.R#135

  1. execute(connectionDetails = connectionDetails, cdmDatabaseSchema = cdmDatabaseSchema, cohortDatabaseSchema = cohortDatabaseSchema, cohortTable = cohortTable, outputFolder = outputFolder, databaseId = databaseId, databaseName = databaseName, databaseDescription = databaseDescription, verifyDependencies = TRUE, …

Hi @gongliangz

I fixed this error by adding “ignoreTransactions=1” to the end of the connectionString argument in createConnectionDetails.

Then I realized that I was still using DatabaseConnector v5.0.4. I tried the same fix using DatabaseConnector v6 and I was still getting that error. :man_shrugging:

Hi Kyle,
Thank you for your information.
I will try it. Thanks
Best,
GZ

Hi team,
Databricks has a JDBC driver (https://www.databricks.com/spark/jdbc-drivers-download)
However, in databaseconnector R package, there is some predefined settings (see below). Do anyone succeed in using Databricks JDBC driver? thanks

connectSparkfunction(connectionDetails) { inform(“Connecting using Spark JDBC driver”) jarPath ← findPathToJar("^SparkJDBC42\.jar$", connectionDetails$pathToDriver) driver ← getJbcDriverSingleton(“com.simba.spark.jdbc.Driver”, jarPath)

@kzollove: Below is a summary of my error messages.
DatabaseConnector v5.0.4 with or without “ignoreTransactions=1” : the execute function can run for a certain steps, then have an error message:
ad 1 returns error: “object ‘connectionString’ not found” when using argument(s): list(args = list(connectionDetails = list(dbms = “spark”, extraSettings = NULL, oracleDriver = “thin”, pathToDriver = “c:/temp/jdbcDrivers”, user = function () nrlang::eval_tidy(userExpression)

Dr. Schuemie suggested to use DatabaseConnector v6. I tried v6 with or without “ignoreTransactions=1”, and an error occurred when loading cohort definitions.
Error in rJava::.jcall(conn@jConnection, “Ljava/sql/DatabaseMetaData;”,
java.sql.SQLException: [Simba]JDBC Cannot use commit while Connection is in auto-commit mode.

t