OHDSI Home | Forums | Wiki | Github

Databricks (Spark) coming to OHDSI stack

Hello all,

I’m happy to announce that we’re getting closer to official Databricks (Spark) support in much of the OHDSI stack! We will present about this a bit during the August 17, 2021 Community call.

Here’s a roadmap of upcoming deliverables:

Add Spark translation support to SqlRender
ETA: Master branch in August 2021
We now have Spark support in the develop branch of SqlRender. We’ve provided a testing Databricks cluster to the OHDSI CI platform, but some configuration is pending.

Add Spark support to Atlas / WebAPI
ETA: Atlas v2.10 end of August 2021
Thanks to @gregk and the Odysseus team, we have a working version of Atlas with Spark support. We are aiming for submitting a PR for these changes in August, targeting Atlas v2.10.

Add Spark support for DatabaseConnector
ETA: September 2021
Checking with Databricks if the JDBC driver can be included (ideally yes, but we can move forward even if not). This code also handles bulk insert via DBFS (Databricks File System).

Testing data quality R packages
ETA: November 2021
Thus far, we have Achilles working, but DataQualityDashboard is pending.

Testing PLE HADES R packages
ETA: January 2022
CohortDiagnostics and CohortMethod will be prioritized here.

Tagging a few folks I believe have been interested:
@Vojtech_Huser @krfeeney @Christian_Reich @msuchard



Congratulations - great addition, Ajit! I know this specific database addition will definitely make a few folks excited.

1 Like

Hi @Ajit_Londhe, we are currently in the process of implementing an OMOP data model here at NACHC in our Databricks environment. We are very interested in collaborating and contributing with others in the community that are also working in Databricks/Apache Spark. We are especially interested in testing/validation tools (as we would like to test/validate our system) as well as what we need to do to enable OMOP users to query our system. Please let us know how we can best engage with the community.


1 Like

Sure, would be great to connect. We will be presenting a bit about this work during the Aug 17 community call, but happy to jump on a call.

At the moment, our work that is public is SqlRender (GitHub - OHDSI/SqlRender at develop), that’s a good place to get started.

Congratulations! We are super excited about this.

Thanks @Ajit_Londhe! I’ll take a look at what is there. We’re working on getting some of our general Databricks utilities published to a public Maven repository (it is in github here GitHub - NACHC-CAD/core). I would be excited to share with you (and anyone else interested) what we are doing. Is there an OMOP Databricks/Apache Spark working group? Please reach out to me with your availability to meet at johngresh@curlewconsulting.com.

Has there been any progress?

SqlRender master branch has all necessary Spark translations now. DatabaseConnector, we ran into an issue with insertTable (batch load), so that is the current blocker before we raise a PR.

Atlas 2.11 has the Spark support, though not officially released yet.

DQD and HADES are backlogged for us at this point, but hoping to pick them back up in a few months.

1 Like

Ajit, we’re looking forward to being able to connect the OHDSI stack to DataBricks, so thanks for your leadership on this.

I have a PR for DQD that works for DataBricks in sqlOnly mode. It also works for Spark on HDInsight (also in sqlOnly mode). It may work from R, but I’ve never tested that as I just drop the SQL into my ETL pipeline.