SqlRender and Spark

Vojtech_Huser · November 5, 2020, 6:51pm

CMS VRDC (Virtual Research Data Center) new platform is based on Databricks. I understand it has a database layer (called Delta). In order to run SQL, I think the flavor to use is Spark.
See links below (optional)

https://docs.databricks.com/delta/intro-notebooks.html#delta-lake-quickstart-sql-notebook

I have questions to the community:

Did you have to deal with Spark SQL flavor at your site (in relationship to OMOP CDM shaped data)?

Did you try to add Spark support to SqlRender (and with what results)? (how different is that flavor and can it be ever supported?)

Ajit_Londhe · November 5, 2020, 7:38pm

Hi @Vojtech_Huser,

This is the platform we use. I’ve created a fork of both DatabaseConnector and SqlRender, and it’s mostly good, but needs some more validation.

A few items to note:

I use delta tables for all tables, so that all standard update/delete operations are allowed (standard tables do not allow this)
There is an attempt at MPP bulk loading for insertTable(), using the python library and DBFS
Temp tables aren’t supported, so I’m using the oracleTempSchema cadence to point to an actual schema where permanent tables that are then dropped are kept

I’m trying to clean this work up for a PR into the master branches by end of December.

acoltri · December 9, 2020, 7:03pm

Hello Ajit - I am very interested in your progress. We are currently split - using Databricks/Spark for a main datalake, but moving data to a MSSQL based server to take advantage of all the OHDSI tools.

Are you envisioning a complete Spark implementation - all the CDM tables, vocabularies, result tables etc. in Spark - but still with access to atlas and the DQD, and the R libraries?

Thank you, Alan Coltri

Ajit_Londhe · December 10, 2020, 12:57am

I’m not a spark expert, so the work I’ve been doing is to just get all the OHDSI tools to work with Databricks by ensuring SqlRender has all the translation patterns needed to go from OHDSql (SQL Server) syntax to Spark SQL (with a few Databricks-specific items).

There are spark-based approaches that could be more performant, but I’m more interested in having all of the OHDSI tools working for my organization than the performance.

Currently, I have Achilles, DQD, and Atlas working against Databricks. I haven’t tried the statistical packages like CohortMethod yet.

Vojtech_Huser · April 26, 2021, 2:25pm

I want to post an update to this. We now have access to DataBricks environment (from CMS, in VRDC). I am looking for other OHDSIers that work within VRDC and would be willing to exchange lessons learned. (or sites that also use Spark SQL dialect).

For us the task at hand is to write ETL into OMOP CDM (using Spark SQL only or SQL+R/Python) from TMSIS/TAF Medicaid data. See here for what is TAF format. https://www.researchgate.net/publication/346632577_Data_Characterization_of_Medicaid_Legacy_and_New_Data_Formats_in_the_CMS_Virtual_Research_Data_Center

Ajit_Londhe · April 27, 2021, 8:56pm

An update on Spark: I’ve raised a PR for SqlRender’s develop branch. As @schuemie has pointed out, we don’t have a good testing environment for OHDSI, so we can’t yet merge into that branch. I’ve inquired with Databricks if such a cluster can be made available.

@Vojtech_Huser what version of Spark are you using?

Vojtech_Huser · May 6, 2021, 2:50pm

The platform we are mere users of(not admins) is using databricks v 7.1
That should have spark v 3.

Here is the help link https://docs.databricks.com/spark/latest/spark-sql/index.html

In that link - we go with help for: “7.x and above”