OHDSI Home | Forums | Wiki | Github

Databricks (Spark) coming to OHDSI stack

Hey @Ajit_Londhe,

Yesterday there was a breakthrough with respect to the issue of the spark JDBC driver throwing errors on some of the SQL in the HADES packages.

It appears that legacy Spark JDBC drivers translate ANSI-SQL-92 queries to the Databricks SQL dialect before sending them to the server. If the application generates (spark) Databricks SQL directly, UseNativeQuery=1 should be stated as a connection configuration. With that setting, the driver passes the SQL queries verbatim to Databricks.


We’re running tests now with the new connection string and it seems to have addressed the SQL hiccups we had. I will post back next week with a more thorough assessment.

If this solves the issue, maybe DatabaseConnector can continue supporting Spark with the added reference to the configuration, at least until such time as a partnership with Databricks can be finalised, I would imagine.

1 Like

@Luis_Pinheiro and @schuemie – I owe you and other Spark folks annoyed an apology. I neglected to mention that we used the “UseNativeQuery=1” jdbc parameter at our site to get it to work.

Documentation, documentation, documentation…

@schuemie – this is what I wanted to verify with the OHDSI spark instance. We will test this and get back to you.

@Ajit_Londhe I sent an email to your OHDSI address yesterday re convening a group to discuss support for DataBricks in OHDSI. Let me know if you didn’t get it and if you’re willing let me know your new email address. Thanks!

Just replied

Thanks Ajit!

We’ll be going forward with a call about community support for Databricks and related test environment needs.

John Gresh, your idea of a working group will be on the agenda and I will assume you want an invite unless you say otherwise.

We will make sure to include Martijn and Lee so we have a full understanding of the testing environment needs.

Everyone on this thread and everyone involved in using Databricks in the community is welcome.

Chris Roeder, Brad Rechkemmer, Thomas White, Luis Pinheiro, and @venkyvb please post here or email me if you are interested or know of others who should be included.

Amir Kermany would you or someone else from Databricks be able to join?

In addition to those on the thread, we’ll invite Janos Hajagos who disseminated great information about his work at Stoney Brook U in this poster, and the amazing Jacob Zelco who let me know that GTRI is very interested in all this. We’ll invite the estimable Paul Nagy and anyone else from the fantastic JHU team he thinks should be involved.

Various folks from Tufts working our ETL rebuild including Kyle Zolo-Venecek and Cal Harding and Elina Hadelia and hopefully some new faces will be there also.

It will probably take a minute to get everyone’s schedule aligned, so please be as prompt and flexible as you can when Kevin Auguste contacts you about this. Thanks!

1 Like

Thanks @Andrew ! I would be interested in this topic.

Glad to hear it Venki. If you send me your email address, I’ll make sure Kevin includes you in the Doodle polls for scheduling.

Hey Andrew, I have just replied with my email address. Thanks !

Hi @Andrew
I replied with my email and a colleague of mine’s email address.

I’m in! @Andrew, I messaged you my email.

Hello @Luis_Pinheiro - I am trying to set-up OHDSI tools to work with OMOP CDM in Databricks. As a part of this I am at the stage where I want to run Achilles scripts to create the correspond tables. I came across this thread and wanted to check if you have some details around the steps and the SQL notebooks needed to run Achilles with CDM in Databricks (delta lake).
Thanks a lot !

Hi @venkyvb,

Since the inclusion of the UseNativeQuery=1 argument in the connection string, Achilles runs fine for me! I’m not sure what you mean with SQL notebooks, if you mean creating them on Databricks, I don’t do that unless I need to debug the SQL.

I’m also aware that there might be other SQL translation issues with other OHDSI packages, but I’ve not yet come across these.

Finally, I’d note that an OHDSI Databricks User Group has been created that will help address issues with Databricks/Spark.

Does this answer your question?

1 Like

@venkyvb @Luis_Pinheiro

I have been troubleshooting running Achilles against our CDM in Databricks. I am running Achilles from my desktop RStudio using a JDBC connection to Databricks. Along with UseNativeQuery=1, I also needed to add flag ignoreTransactions=1 to the connection string in order to alleviate the 10040 error (though I am not sure if this error was affecting output):

If I run a small subset of analyses, I get the “Error in writeChar…” seen above. It shows up in achilles_log.txt as a FATAL.

For a small subset of analyses, it seems like Achilles results are being written (…correctly possibly?), but the achilles function clearly does not finish running correctly as none of the temporary files are deleted.

Are most people running achilles using a remote JDBC connection to databricks?

Hopefully there is a useful nugget in here for someone less far along than me. If anyone has advice on fixing this recent error, I’d love to know!

1 Like

I am trying to do the same (run Achilles from my laptop) connecting to CDM in Databricks. I am using the Broadsea-Achilles image and I am facing a connection issue. Are you using Personal Access Token to authenticate to Databricks or using Username/Password ?

Achilles uses JDBC connections, which Databricks supports. You wouldn’t use Databricks notebooks.

Personal Access Token

Hi Everybody,

We’re in the process of setting up our Databricks OHDSI instance. Is there existing documentation on how to stand up an instance of OMOP/OHDSI/WebAPI specifically or is it currently embedded in the existing documentation listed below?

Do folks generally create a schema using ddl or do most (some/any) etl the data directly (e.g. from exported .csv files that are uploaded directly and used to create the CDM tables)?

Is the process to deploy using Databricks basically: create/populate your CDM tables, update your webapi data source tables to point to the CDM in Databricks, update the config files for WebAPI/Atlas (and of course, integrate the correct jdbc jar and configure security), and your done or is there more to it that this?

Is there a recommendation for weather to create the webapi schema in Databricks or to keep it in a relational database like PostgreSql?

@venkyvb @kzollove: Can you provide the configuration files (with redacted uid/pwd/etc.) you used for the Broadsea implementation?

WebAPI Deployment: The following are the high-level steps for deploying WebAPI

Pre-requite software is listed here:
Install and configure a PostreSQL server, database and users for use when deploying WebAPI:
PostgreSQL Installation Guide · OHDSI/WebAPI Wiki · GitHub.
Verify the database connectivity:
WebAPI Installation Guide · OHDSI/WebAPI Wiki · GitHub
Install WebAPI:
WebAPI Installation Guide · OHDSI/WebAPI Wiki · GitHub
Deploy WebAPI to Tomcat:
WebAPI Installation Guide · OHDSI/WebAPI Wiki · GitHub
Verify deployment:
WebAPI Installation Guide · OHDSI/WebAPI Wiki · GitHub
Troubleshooting FAQ:
WebAPI Installation Guide · OHDSI/WebAPI Wiki · GitHub
Configure the CDM for use with WebAPI:
CDM Configuration · OHDSI/WebAPI Wiki · GitHub

Atlas deployment
The ATLAS setup guide is found here:
https://github.com/OHDSI/Atlas/wiki/Atlas-Setup-Guide. The high level steps include:
Troubleshooting guide:

We used Odysseus to create our initial OHDSI instance, which used SQL Server. When we shifted to using DataBricks, we didn’t have to change anything in the OHDSI stack. So, the standard install is likely to work OK.

We kept all of the WebAPI calls using PostGres, both because it has great performance, and because the queries aren’t certified for Databricks.

For the last 9 months, we’ve been using Databricks to do all of our OMOP data pipelining & ETL - from creating schemas & tables through populating the OMOP instances, plus running all of the QA. So, rather than using R to run Achilles or DQD directly, we used R to generate Spark SQL, then run that SQL directly in our ETL pipeline. This has improved performance and made it easier to monitor, start and resume if anything goes wrong. This also let us easily customize what we needed (e.g. adding data lineage tracking + local columns to certain OMOP tables, plus tweaking or commenting out sections of Achilles code that we didn’t need or which were inefficient) - while maintaining all of the packaged ETL in our git repo to facilitate distributed development, testing, and release management.

As a last thought, I recommend using a DataBricks SQL Warehouse instance (rather that a shared compute cluster) to support your instance. The Warehouse approach has better caching, logging, and throughput than a data engineering compute cluster.

Although Databricks gives use much better performance on standard Atlas analyses (e.g. cohort generation, characterization, incidence rates, etc.), we haven’t gotten to the point that we can use it for GUI navigation (e.g. clicking searching for concepts and clicking through hierarchy and relationships). So, we have a separate vocabulary_only data source connected to SQL Server to get the desired GUI performance. A benchmarks, when we use SQL Server to do vocabulary searches, the usual response time is about 3 seconds per click. Using a DataBricks SQL Warehouse, it is about 6-8 seconds per click (hierarchy or relationships). Using a non-warehouse compute cluster, it is about 25-30 seconds per click. My hope is that with better table or query optimization, we can get comparable performance on GUI navigation using a DataBricks SQL Warehouse.

Theoretically, Serverless Warehouses will provide even better throughput, so that may be worth exploring to ensure that the GUI has adequate performance.


The WebAPI schema (and the actual database) should be on PostgreSQL because the application migration scripts are only maintained for PGSql. But the WebAPI database is not your analytic platform: WebAPI just stores information on definitions, job execution status, security roles, and other application support functions. It depends on typical RDBMS functions that are not available in those large-scale data-warehouse platforms (like sequences, foreign keys, etc). So, you should expect to have a dedicated PG instance to host WebAPI and N number of external CDMs that WebAPI references through the SOURCE table (which points to a variety of CDM platforms like mssql, oracle, PG, databricks, netezza, pdw, redshift, etc)