OHDSI Home | Forums | Wiki | Github

SqlRender and Spark

(Vojtech Huser) #1

CMS VRDC (Virtual Research Data Center) new platform is based on Databricks. I understand it has a database layer (called Delta). In order to run SQL, I think the flavor to use is Spark.
See links below (optional)


I have questions to the community:

Did you have to deal with Spark SQL flavor at your site (in relationship to OMOP CDM shaped data)?

Did you try to add Spark support to SqlRender (and with what results)? (how different is that flavor and can it be ever supported?)

(Ajit Londhe) #2

Hi @Vojtech_Huser,

This is the platform we use. I’ve created a fork of both DatabaseConnector and SqlRender, and it’s mostly good, but needs some more validation.

A few items to note:

  1. I use delta tables for all tables, so that all standard update/delete operations are allowed (standard tables do not allow this)
  2. There is an attempt at MPP bulk loading for insertTable(), using the python library and DBFS
  3. Temp tables aren’t supported, so I’m using the oracleTempSchema cadence to point to an actual schema where permanent tables that are then dropped are kept

I’m trying to clean this work up for a PR into the master branches by end of December.

(Acoltri) #3

Hello Ajit - I am very interested in your progress. We are currently split - using Databricks/Spark for a main datalake, but moving data to a MSSQL based server to take advantage of all the OHDSI tools.

Are you envisioning a complete Spark implementation - all the CDM tables, vocabularies, result tables etc. in Spark - but still with access to atlas and the DQD, and the R libraries?

Thank you, Alan Coltri

(Ajit Londhe) #4

I’m not a spark expert, so the work I’ve been doing is to just get all the OHDSI tools to work with Databricks by ensuring SqlRender has all the translation patterns needed to go from OHDSql (SQL Server) syntax to Spark SQL (with a few Databricks-specific items).

There are spark-based approaches that could be more performant, but I’m more interested in having all of the OHDSI tools working for my organization than the performance.

Currently, I have Achilles, DQD, and Atlas working against Databricks. I haven’t tried the statistical packages like CohortMethod yet.