Databricks (Spark) coming to OHDSI stack

Ajit_Londhe · July 27, 2021, 7:21pm

Hello all,

I’m happy to announce that we’re getting closer to official Databricks (Spark) support in much of the OHDSI stack! We will present about this a bit during the August 17, 2021 Community call.

Here’s a roadmap of upcoming deliverables:

Add Spark translation support to SqlRender
ETA: Master branch in August 2021
We now have Spark support in the develop branch of SqlRender. We’ve provided a testing Databricks cluster to the OHDSI CI platform, but some configuration is pending.

Add Spark support to Atlas / WebAPI
ETA: Atlas v2.10 end of August 2021
Thanks to @gregk and the Odysseus team, we have a working version of Atlas with Spark support. We are aiming for submitting a PR for these changes in August, targeting Atlas v2.10.

Add Spark support for DatabaseConnector
ETA: September 2021
Checking with Databricks if the JDBC driver can be included (ideally yes, but we can move forward even if not). This code also handles bulk insert via DBFS (Databricks File System).

Testing data quality R packages
ETA: November 2021
Thus far, we have Achilles working, but DataQualityDashboard is pending.

Testing PLE HADES R packages
ETA: January 2022
CohortDiagnostics and CohortMethod will be prioritized here.

Tagging a few folks I believe have been interested:
@Vojtech_Huser @krfeeney @Christian_Reich @msuchard

Thanks,
Ajit

gregk · July 27, 2021, 8:15pm

Congratulations - great addition, Ajit! I know this specific database addition will definitely make a few folks excited.

greshje · August 3, 2021, 3:19pm

Hi @Ajit_Londhe, we are currently in the process of implementing an OMOP data model here at NACHC in our Databricks environment. We are very interested in collaborating and contributing with others in the community that are also working in Databricks/Apache Spark. We are especially interested in testing/validation tools (as we would like to test/validate our system) as well as what we need to do to enable OMOP users to query our system. Please let us know how we can best engage with the community.

Thanks!
John

Ajit_Londhe · August 3, 2021, 5:05pm

Sure, would be great to connect. We will be presenting a bit about this work during the Aug 17 community call, but happy to jump on a call.

At the moment, our work that is public is SqlRender (GitHub - OHDSI/SqlRender at develop), that’s a good place to get started.

Amir_Kermany · August 4, 2021, 4:01am

Congratulations! We are super excited about this.

greshje · August 4, 2021, 1:26pm

Thanks @Ajit_Londhe! I’ll take a look at what is there. We’re working on getting some of our general Databricks utilities published to a public Maven repository (it is in github here GitHub - NACHC-CAD/core). I would be excited to share with you (and anyone else interested) what we are doing. Is there an OMOP Databricks/Apache Spark working group? Please reach out to me with your availability to meet at johngresh@curlewconsulting.com.

CRoeder · April 4, 2022, 8:35pm

Has there been any progress?

Ajit_Londhe · April 5, 2022, 4:59pm

SqlRender master branch has all necessary Spark translations now. DatabaseConnector, we ran into an issue with insertTable (batch load), so that is the current blocker before we raise a PR.

Atlas 2.11 has the Spark support, though not officially released yet.

DQD and HADES are backlogged for us at this point, but hoping to pick them back up in a few months.

Thomas_White · May 3, 2022, 1:27am

Ajit, we’re looking forward to being able to connect the OHDSI stack to DataBricks, so thanks for your leadership on this.

I have a PR for DQD that works for DataBricks in sqlOnly mode. It also works for Spark on HDInsight (also in sqlOnly mode). It may work from R, but I’ve never tested that as I just drop the SQL into my ETL pipeline.

Thomas_White · October 5, 2022, 1:42pm

@Ajit_Londhe, we’ve been really happy with OHDSI performance on DataBricks. However, we’ve run into two issues and hope you know a solution:

Heracles Cohorts - these run for the expected duration, but fails to insert results in one of the final steps.
Cohort Pathways - similar challenge - they run for the expected duration, but don’t insert results into the final tables.

My guess is that this has something to do with SqlRender. Have you run into this issue and/or found a solution?

/Tom

Ajit_Londhe · October 5, 2022, 2:01pm

Hi @Thomas_White ,

Heracles, we’ve just not used it, so that could be a blind spot that needs fixing.

Pathways work though. Are you using the official sqlrender release and not the old fork version i shared previously?

Tagging @Brad_Rechkemmer as my role is changing.

Thanks,
Ajit

Thomas_White · October 5, 2022, 2:51pm

@Ajit_Londhe - thanks. All the best with your new role. Tagging @Brad_Rechkemmer too.

Yes, we’re using the official release of SqlRender that comes with Atlas 2.11.1.

For Heracles, we’re getting this message:

[Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: [DELTA_INSERT_COLUMN_MISMATCH] com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Column stratum_5 is not specified in INSERT

When running the final query:

insert into omop.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, count_value)
select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING),
 cast(stratum_3 as STRING), cast(stratum_4 as STRING), count_value
from tmp.xrfc9243results_1
UNION ALL
select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING),
 cast(stratum_3 as STRING), cast(stratum_4 as STRING), count_value
from tmp.xrfc9243results_2

Effectively, since stratum_5 isn’t explicitly mentioned in the INSERT string, Spark is throwing an exception.

I’m guessing that is an issue with SqlRender or the SparkJDBC driver as opposed to an issue with Heracles itself.

Ajit_Londhe · October 7, 2022, 1:21pm

I think it’s because we didn’t add the sparkHandleInsert function call from sqlrender for heracles in webapi. Essentially, we need to intercept any insert commands and reconstruct the query to use the full destination table. This works everywhere else in atlas, except with heracles

Chris_Knoll · October 13, 2022, 5:44pm

This could also be resolved by adopting a ‘proper form’ of sql, such that we just need to be careful how we structure the code so that it works on the widest possible dbms. In this case, if the columns need to be declared Ie: cast(x as int) as colName, we should do this. Putting in special dialect specific handlers is something we want to avoid. Perhaps there’s a way we could take a sql statement and scan for potential problems (such as selects that derive a column value without naming the column, or mismatch in column names where insert declares one set of columns but select declares a different set). Then, even tho it may work on some platforms, we could log warnings.

venkyvb · December 8, 2022, 11:06pm

Great thread. I didnt find any announcement threads, but wanted to check if OHDSI supports Spark (Delta tables) ? I see that under DDL CommonDataModel/inst/ddl/5.4 at main · OHDSI/CommonDataModel · GitHub there seems to be support for Spark. A related question is, does usage of Spark require Databricks or would any Spark cluster would work?

Ajit_Londhe · December 9, 2022, 2:11pm

Hi @venkyvb – any Spark cluster should do, as this is based on Spark 3.1+ specification.

At this time, it has been tested with Atlas, but there’s more testing needed for HADES packages. Tagging @Brad_Rechkemmer

venkyvb · December 9, 2022, 11:03pm

Thanks @Ajit_Londhe . This is really awesome !

Luis_Pinheiro · December 10, 2022, 8:00am

Hi all,

Looking for feedback more broadly on your experience with Databricks.

Our experience with using OHDSI packages from a local R instance has been somewhat frustrating. A number of SQL statements from Achilles and other packages throw errors using the old simba jdbc driver, but do run in a Databricks cloud SQL notebook. We’ve spent quite some time changing the SQLs but that’s not sustainable.

Any similar experiences that could guide us a bit?

Thanks a lot!
Luis

schuemie · December 12, 2022, 11:50am

As @Luis_Pinheiro has pointed out, the experience so far with Spark has been pretty bad. I’ve been asking for a DataBricks Spark testing environment (which apparently is something different from a Apache Spark testing environment), but this seems to be almost impossible.

If this persists I’ll have no choice to remove Spark from the list of supported platforms for HADES. We shouldn’t pretend to support a platform when in fact we don’t.

Ajit_Londhe · December 14, 2022, 10:31pm

@Luis_Pinheiro – can you share your JDBC url (without any sensitive items like credentials)?