Question about importing large OHDSI files into Databricks

greshje · September 11, 2021, 10:56pm

Hi Everybody,

Just wondering what folks are using to import the large OHDSI terminology tables into Databricks from the tab delimited text file from Athena (Athena).

We’re currently using the API but its taking a long time for the larger files, hours for concept_relationship (~2G), concept_ancestor (~1.3G).

What sort of times are you seeing to upload these files and what are you using to upload them to Databricks?

Ajit_Londhe · September 15, 2021, 1:53pm

I’m not expert on this, as our ETL is handled by a different team in my company, but I’d imagine a multipart S3 upload shouldn’t be too bad for those files, no?

kzollove · January 17, 2023, 7:39pm

@greshje

We’re working on uploading and creating a workflow for updating OHDSI vocabulary tables in Databricks.

Did you end up finding a solid solution?

rookie_crewkie · January 18, 2023, 9:34am

Hello @kzollove, welcome to the community.

There are several possible ways:

use DataFrameReader API to load data into the tables — documentation has examples for Scala/Java/Python
put the files in cloud storage or DBFS and create an external table on top of them using Spark SQL
use COPY INTO SQL command to load data; note that this solution is Databricks-specific

kzollove · January 18, 2023, 2:58pm

Thanks @rookie_crewkie!

On the Athena side, it seems like there is no way around manually downloading and uploading files whenever there is an update to the vocabulary tables. Is that correct?

rookie_crewkie · January 18, 2023, 3:36pm

@kzollove,

Correct, unfortunately there’s no Athena API for creating a vocabulary bundle without manual interaction. Many people (me too) would like to use that for automation, and the question often pops up on the forum, but apparently this would require major changes in the current implementation.

Zach_Higgins · January 26, 2023, 8:57pm

Hi @kzollove - I was working on something similar for a POC and came across this (it’s not in the solutions accelerator last I checked).

rwe-lakehouse - Databricks

In their setup example, they have a gzip compressed tar sitting on an s3 bucket, which helps with the download portion a bit. As @rookie_crewkie mentioned, ATHENA doesn’t have this capability, but was discussed here (and probably elsewhere, I’m still new here): API to automate downloading vocab files? - Developers - OHDSI Forums

Long story short, best solution I came up with, inspired by the databricks example, was storing the vocabulary concepts on an s3 bucket, downloading/extracting using %sh magic commands in a cell, create a dataframe for each CSV in the extracted directory, write CSV using dataframe writer.

I can share my example with you if you’re interested. It’s much less complex than the databricks example, plus it works in the community edition.

Philip_Solovyev · January 27, 2023, 9:13am

Hi @Zach_Higgins, please do! Why not? It will be helpful for the community.

Zach_Higgins · January 30, 2023, 4:03pm

Hi Philip, good point . The example runs faster if you have the compute in your own workspace. If working in the community edition of Databricks, be prepared to wait about 15 minutes.

A quick note about managing these tables - I haven’t been able to import the CSV’s to Usagi when exported from the notebook using the GUI. Usagi appears to want a tab delimited csv (tsv), but Databricks does CSV by default. If you want to make your own pipeline for whatever reason to bring this into Usagi, you’ll need to make sure it’s exported using the correct format using the dataframe writer.

OMOP-Databricks/import_vocab.ipynb at main · mrzachhigginsofficial/OMOP-Databricks (github.com)

kzollove · February 10, 2023, 4:05pm

Thanks so much for sharing! While not checking OHDSI forums for the last two weeks, I’ve been busy creating a similar solution based on rwe-lakehouse.

This is really great and thanks for making it public.

Philip_Solovyev · February 10, 2023, 5:45pm

@Zach_Higgins thank you for sharing!