Development of an ED Python-based Data Pipeline

vsocrates · August 8, 2022, 3:27pm

Hi everyone! I’ve also posted on the Working Group teams page, so sorry for the spam. I’m new to the community (Yale just released the internal beta of our OMOP analytics setup). My lab is developing a Python-based dataset pipeline similar to HADES FeatureExtraction (with a few aspects of ACHILLES), specific to the emergency dept. We would like to release the package as a public tool and we would love some feedback!

Mainly, we use PySpark Dataframes as the underlying data structures to operate over. Is this common or do other teams use other structures? I’m guessing since HADES is in R, R dataframes are used, but how about any python shops? (e.g. pandas dataframes)? Thank you in advance and sorry if I’m not making sense, still new to this!

Jinchoi · August 11, 2022, 1:07am

I’m not an OHDSI package expert, but as far as I know…

Generally, data is fetched through JDBC connected from RDB, and forms such as r data.frame, matrix, and dgCmatrix are used in OHDSI R packages

The community will have very little experience with Spark.
There are a few people who want to analyze CDM with python, and most of them seem to be doing pandas-based analysis.

It is very welcome to build a feature extraction pipeline using Python. However, please keep in mind that the strength of the OHDSI package relies heavily on the participation of domain experts through ATLAS GUI. Compatibility with ATLAS will greatly increase usability.