OHDSI Home | Forums | Wiki | Github

Connect to Databricks using Broadsea deployment

I am trying to do a Broadsea deployment that needs to connect to the CDM database in Databricks. I cloned Broadsea repo and then went through the steps to connect to the Postgrest webapi schema and run the source daimon script -

INSERT INTO webapi.source( source_id, source_name, source_key, source_connection, source_dialect, is_cache_enabled) 
VALUES (7, 'ADB_OMOP_7', 'ADB_OMOP_7', 
  'jdbc:spark://<databricks_url>:443/default;transportMode=http;ssl=1;httpPath=<somePathHere>;AuthMech=3;UseNativeQuery=1;UID=token;PWD=<personalAccessToken>', 'spark', true); 
 
-- CDM daimon 
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (23, 7, 0, 'omopcdm_demo', 2); 
 
-- VOCABULARY daimon 
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (24, 7, 1, 'omopcdm_demo', 2); 
 
-- RESULTS daimon 
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (25, 7, 2, 'omopcdm_demo', 2); 
 
-- EVIDENCE daimon 
INSERT INTO webapi.source_daimon( source_daimon_id, source_id, daimon_type, table_qualifier, priority) VALUES (26, 7, 3, 'omopcdm_demo', 2); 

I download the Spark JDBC driver from Databricks and placed it at the main (root) directory of the cloned Broadsea repository (where the docker-compose.yml is located).

However when I run Broadsea and check the Docker logs for the ohdsi-webapi container I see the error -

http-nio-8080-exec-7 com.odysseusinc.logging.LoggingService - [] - Could not get JDBC Connection; nested exception is java.sql.SQLException: No suitable driver found for jdbc:spark:/...

When I check the logs I see that there is an error loading the spark JDBC driver -

...
uery.jdbc42.Driver driver. com.simba.googlebigquery.jdbc42.Driver
ohdsi-webapi      | 2023-03-18 05:25:33.216 INFO localhost-startStop-1 org.ohdsi.webapi.DataAccessConfig - [] - error loading org.apache.hive.jdbc.HiveDriver driver. org.apache.hive.jdbc.HiveDriver
ohdsi-webapi      | 2023-03-18 05:25:33.217 INFO localhost-startStop-1 org.ohdsi.webapi.DataAccessConfig - [] - error loading com.simba.spark.jdbc.Driver driver. com.simba.spark.jdbc.Driver
ohdsi-webapi      | 2023-03-18 05:25:33.217 INFO localhost-startStop-1 org.ohdsi.webapi.DataAccessConfig - [] - error loading net.snowflake.client.jdbc.SnowflakeDriver driver. net.snowflake.client.jdbc.SnowflakeDriver
....

Any hints as to what am doing wrong here?

The drivers won’t be found in the root folder of the repo. Instead, if you are able to navigate to the filesystem inside the boradsea container, you can place the jdbc drivers under the tomcat directory: \webapps\WebAPI\WEB-INF\lib. If you restart the tomcat service, it should reload the libs found there for WebAPI, and the driver should be found (you should see it loaded in your startup log).

I think, the ‘correct’ way is to have the WebAPI build include the webapi-snowflake profile in the maven build command so that the drivers will be downloaded and installed into the WebINF folder automatically. @Ajit_Londhe , do you have information on how this can work? I remember you were working on adding parameterization to the docker configuration, I wonder if there’s a way to add additional properties for databricks and google big query? Or maybe have the docker build include those profiles by default?

1 Like

I don’t currently have bringing a custom JDBC file into the WebAPI build in scope. I was focused on using WebAPI as-is, but with security, ssl, Ares, Solr Vocab convenience.

I’ll need to review bringing a different JDBC file.

Thanks @Chris_Knoll and @Ajit_Londhe for the responses !
@Chris_Knoll - what you said makes total sense. I was going through the Broadsea Readme, specifically this which gave me the impression that the non-shipped drivers need to be added to the host directory where the docker-compose.yml was located. As you suggested, I’ll try to build a container image using the webapi-spark profile and see if that works.

@Ajit_Londhe - Since you have contributed to the Databricks integration for WebAPI, I was wondering how WebAPI can connect to Databricks via JDBC. When I look at the WebAPI project POM, I see a profile called webapi-spark. However when I use this to build a docker image, the maven build fails as the jar spark-2.6.22.1040.jar is not available in maven central. Is there a way of using WebAPI as-is to connect to Databricks ?

As a test, I was able to get the set-up working with Databricks by including the Spark JDBC Driver downloaded from Databricks and tweaking the webapi-docker profile to include the spark dependencies (which also needed to be tweaked) and building a container and using that with the Broadsea docker-compose.yml.
Would love to understand if there is a ‘recommended’ way of doing this for Databricks.

Another observation - looks like the Simba JDBC Driver is only available from Databricks site (not sure if this is a licensed bit of software) and the newer versions of the Databricks driver is no longer has the com.simba. package namespace but uses com.databricks.. This would need to be adapted in the WebAPI DataAccessConfig as well.

Hi @venkyvb - I am attempting to perform similar steps for our use case and was curious if you would be willing to share any further details about what was updated on the docker-compose.yml? I updated the data source details, added logic to include the spark driver, and have it spinning up the Databricks Cluster, but am not sure what details (if any) would need to be updated for the spring/flyway inputs.

Thank you in advance!

Hi, in the docker-compose.yml the key change was to point to the custom ohdsi-webapi container image - https://github.com/venkyvb/Broadsea/blob/master/docker-compose.yml#L76

Thank you, this was helpful! I was actually able to get the spark driver to load successfully without a custom image by adding a volume to the ohdsi-webapi section of the docker-compose.yml file as follows:

  • “./SparkJDBC42.jar:/var/lib/ohdsi/webapi/WEB-INF/lib/SparkJDBC42.jar”

What I was missing was how to add the custom source daimons to the webapi database, but was able to connect the dots after seeing your work. Thank you again!

1 Like

How do I get to the Docker file system? I’m not seeing anything in the places mention here:
https://stackoverflow.com/questions/33848947/accessing-docker-container-files-from-windows

C:\ProgramData\DockerDesktop\vm-data\DockerDesktop.vhdx

No vm-data folder

\\wsl$\docker-desktop-data\version-pack-data\community\docker\overlay2

No version-pack-data folder

Making some progress here with lots of great help from @Sanjay_Udoshi !

But, I’m still getting the “No suitable driver found” exception. Stacktrace is attached:
stack-trace.txt (91.1 KB)

My docker-compose.yml is attached.
docker-compose.yml.txt (6.3 KB)

I’ve added the JDBC driver to the Broadsea root directory.

My source and source_daimon entries look like this:


docker ps looks like this:

Docker images looks like this:

I’m able to connect to this Databricks instance using the url I’m using here and JDBC driver I’ve added to the Broadsea root directory.

What do I need to do to fix this?

I’ve typed up my notes and created a How-to page for setting up Broadsea with Databricks. This is currently a DRAFT version. Let me know what you think.

https://ohdsi.github.io/DatabaseOnSpark/developer-how-tos_broadsea.html

— EDIT -----------------------
Adding this as an edit as the forum software won’t let me do 3 or more replies in a row
— EDIT -----------------------

Well, It looks like I’m not out of the woods yet…

I’ve run Achilles and put the results in demo_cdm_ach_res. However, when I launch Atlas, navigate to the “Data Sources” page, and select a report, I get the “Loading Report” gif and everything seems to stop there. There doesn’t seem to be any activity in the broadsea or webapi logs in Docker.

I’m thinking this line at the end of the broadsea log in Docker might be a clue, but I’m not sure what it means:
Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" “172.23.0.1”
2023-05-16 15:09:55 ohdsi-atlas | 2023/05/16 19:09:55 [error] 9#9: *1 “/usr/share/nginx/html/atlas/js/config-gis.js/index.html” is not found (2: No such file or directory), client: 172.23.0.6, server: , request: “GET /atlas/js/config-gis.js/ HTTP/1.1”, host: “127.0.0.1”, referrer: “http://127.0.0.1/atlas/

I’m wondering if this is due to the missing achilles_analysis file (see last screen shot below) described here (with workaround): https://forums.ohdsi.org/t/error-running-achilles-against-databricks/18575/3

Any help resolving this would be greatly appreciated.

Here’s my docker-config.yml file:
docker-compose.yml.txt (6.3 KB)

Here’s what my webapi records look like in PostgreSql:
SOURCE

SOURCE_DAIMON

Here’s my vocabulary database in Databricks:

Here’s my Achilles results database in Databricks

Here’s my webapi log in Docker:

Here’s my broadsea log from Docker:

More tables in the demo_cdm_ach_res schema:

I’m still getting the same behavior: application seems to hang if I try to run a “data sources” report (e.g. Dashboard). The application also seems to hang if I try to generate a cohort (I can generate from the PostgreSql database but not from the Databricks database). It does not look like much is being written to the logs (nothing for the data sources report). @Chris_Knoll and @Ajit_Londhe : Is there a setting that will provide additional logging or different logs somewhere else I should be looking at for more information (e.g. what tables are being queried for the data sources report and/or are there any errors/exceptions)?

Full log files and docker-compose.yml are attached:
web-api-log.txt (36.3 KB)
broadsea-log.txt (291.8 KB)
docker-compose.yml.txt (6.3 KB)

— EDIT ----------------------------

I was able to get everything (Atlas with a PostgreSql webapi schema and everything else in Databricks) to work in a standard/non-Docker environment. I’m able to see much more output in the Tomcat console window than what I’m seeing in the Docker webapi and broadsea logs that are shown in the Docker Desktop application. I’m still not sure how to view all of the Tomcat logs when running in Docker (i.e. Broadsea).

I’ll keep you posted :slight_smile:

So you do have the 3 required Achilles tables in your results schema? The screenshot shows only the temp tables produced by Achilles during the achilles() execution. Please confirm you have achilles_analysis, achilles_results, and achilles_results_dist.

t