Use of Python in OHDSI projects

jbadger3 · May 31, 2020, 4:13pm

Hi Christian,
Yes it’s all set. I didn’t see any way to add repos from the OHDSI github repo page…my guess is you have to be part of the github org…can you help me with the next step?

Thanks! Jon

schuemie · June 1, 2020, 6:54am

@jbadger3: I just created the InspectOmop repo and made you admin

AnthonyDubois · June 1, 2020, 2:00pm

Hi! I have been using Python on OMOP for years and I have already many tools available. It works with Dask / Pandas. Would it be of interest?

jbadger3 · June 1, 2020, 2:29pm

@schuemie & @Christian_Reich: Thanks for the help guys. Repo is posted.

Christian_Reich · June 1, 2020, 7:45pm

@AnthonyDubois:

What do you have?

AnthonyDubois · June 1, 2020, 9:14pm

All you need for:

creating cohort from a json
getting treatment pattern for a cohort of patients (based on selected ingredient) with specific rules for oncology treatments
doing features generation and selection, patient matching, etc…

Part of it is here but required some cleaning: https://github.com/anthonydubois/rwd_analytics/

schuemie · June 2, 2020, 6:43am

Could you help me understand the goal here? Is the idea to

provide example Python code of how folks can talk to the CDM,
to provide a library of Python functions for others to use, or
to start creating a ecosystem of OHDSI analytics in Python

All three are valid, and not mutually exclusive.

gkennos · June 3, 2020, 10:02pm

Hi there.

I’m a Python dev and am currently working on an Oncology ETL that will be pure python.

I’m building a pipeline app that focuses on automating not just the ETL but also the ETL authoring and unit-test generation. The idea is to focus on csv configuration files that can be authored by clinical users so all direct mappings or vocabulary translation mappings can be auto-generated.

At the moment my app is highly interconnected and specific to the project, but I think there are portions of it that could be split out and made generic in order to share e.g. I have my own 6.x (oncology extension) SQLAlchemy ORM that is auto-generated from the postgres ddls.

Also working on training NLP models for de-identification of clinical notes and NER extraction to include in our mappings.

Very keen to stay in the loop for these discussions.

hungdo1129 · June 3, 2020, 11:42pm

Hi @gkennos, our group is also using Python and is also trying to do everything that you mentioned. Let s keep in contact. I will be very happy to contribute to your repos.

hungdo1129 · June 30, 2020, 3:36pm

[Would a Python implementation be friendly to Redshift and Impala?]
Hi all,
I am trying to perform ETL and Data quality check (Achilles/DQD) for data from EHRs. I am not proficient in R and SQL. And I am wondering whether it is worthwhile rewriting OHDSI’s code from R to Python. Especially if I want the Python code to easily scale to Redshift and Impala.

From what I understand, the current R packages seem to serve the purpose of automation, they help us connect to the database, then implement a series of SQL files, then feed the output to the web dashboard.

I also notice that Achilles can work with Impala, and Data Quality Dashboard can work with Redshift. I guess it is because the SQL code is somewhat ‘universal’

In my current Python package, since we have less than 10GB of data, I am actually loading data from PostgreSQL to pandas dataframe, and implement all the table manipulations in dataframe, thus no SQL code in my package.
PRO: (1) I guess that python and pandas can perform more sophisticated mapping logic and quality check than SQL. (2) I am more familiar with Python than SQL
CON: In the long term, will this approach scale when we have ‘very big’ data and have to connect to Redshift or Impala?

Another related question is how ‘big’ is really ‘big data’, and when would we hav to move from PostgreSQL to Redshift or Impala? Some say that PostgreSQL can store up to 1TB of data.

It would be very helpful if someone can give us a limit or some perspective based on your experience.

@jbadger3 : You seem to be working on this issue in your repos. I see that you are using SqlAlchemy. Is this library ‘universal’ to different database backends?

Thank you a lot.
Cheers,
Hung

jbadger3 · July 2, 2020, 1:41pm

@hungdo1129

There is quite a bit to unpack in your post, but let me address a few of the topics that I am more familiar/comfortable with.

‘is it worthwhile rewriting OHDSI’s code from R to Python’. There are a number of packages from the OHDSI group that I think would benefit from an accessible interface with python. The problem here is that if you write a port for the work that has already been done you will be spending a large amount of effort duplicating functionality that already exists and it will be nearly impossible to keep the python and R versions in sync. That being said I think that projects to call into existing R packages from OHDSI (aka bindings) would be a real benefit to the community. There is a python package called rpy2 that might provide the groundwork.
On scalability and big data.
In terms of your question on ‘big data’ and how big is big I am not going to go into a pedantic discussion on what should really be designated as big data, but here is my informal definition. If the data you intend to work with won’t fit on a single storage device (needs to be sharded) and/or requires multiple computers to get the work done in a reasonable amount of time then you are working with big data. With my own experience I have operated on 250GB of EHR data using SQLite on a single machine without any issues…so assuming your 10GB doesn’t scale to multiple terabytes anytime soon I think you will be fine sticking with PostgreSQL for a while. The key to performance here is making sure your database is properly indexed.
Working with the CDM using pure python and InspectOMOP gives you the freedom to be agnostic to the backend database, which has a real advantage if you want to transport your code to another institution for replication or if you end up wanting to switch the software you are using for backend data storage. SQLAlchemy gives you the ‘universal’ backend, so if you want to switch to Redshift in the future you should be able to do so (there is a third party package separate from SQLAlchemy for this).

Hope this helps a bit!
Jon

alexander · July 9, 2020, 4:01am

Hi @MaximMoinat May you share an ETL example in Python with me too? I’m working on ETL for EHR data now and use Python as a main language.

beapen · July 12, 2020, 9:56pm

I started pyomop before I saw @jbadger3 's inspectomop. pyomop is similar but may be easier to extend for ETL and machine learning. https://github.com/dermatologist/pyomop (I have not tested it much yet).

hungdo1129 · July 17, 2020, 7:50am

Hi all and @jbadger3 ,
Thank you a lot for suggesting different packages. I spent sometime studying both pyomop and inspectomop.
I am currently more inclined to the idea of inspectomop, which uses SQLAlchemy to connect to all database back-ends. Our database may not grow over 500GB very soon, however, we might need to move to Redshift in order to allow connections from multiple users. (Another reason is that I am not an expert in SQL anyway, so I would rather learn SQLAlchemy)

In contrast, I think pyomop is still using the raw SQL queries which are stored in the file sqldict.py. I am wondering how this package is easier to extend for machine learning? I guess that after each query, we just convert the result to a dataframe (which both packages offer) and then use the dataframe for data exploration/machine learning?

By the way, @jbadger3, thank you for your long insightful reply. Are you still updating the package on your personal repos, or have you completely migrated to OHDSI’s repos?
(I downloaded it from your personal link 1 or 2 weeks ago)

Cheers and have a nice weekend to all,
Hung

MaximMoinat · July 28, 2020, 11:11am

Great to see developments in Python packages for OHDSI.

Sure, a lot of what we develop at The Hyve is open source: https://github.com/thehyve/ohdsi-etl-caliber

jaan · January 27, 2021, 6:36pm

I would also be curious about how easy it might be to train pytorch models on OHDSI data. Glad to see there are tools that are starting to make it easier!

jposada · January 27, 2021, 6:41pm

hi @jaan,

That is already happening on the PLP package. Chek this out

mikecjohn · February 23, 2023, 6:00pm

Hi everyone, I realize this thread is old but I am interested in connecting with people using Python with OHDSI tools. I see the risk mentioned above about keeping R and Python tools in sync, so does anyone have guidance in how the community is thinking about incorporating Python in ways that don’t duplicate the existing functionality of the R tools?

Adam_Black · March 1, 2023, 7:17pm

Here is one interesting python package to check out.

https://inpsectomop.readthedocs.io/en/master/

frankm · August 9, 2023, 2:04pm

At IKNL we started to develop a python package which wraps the OHDSI tools (DatabaseConnector, Circe, FeatureExtraction, etc) using the rpy2 library. We are still figuring out on how to make it the most pythonic ohdsi experience as we can. Happy to accept contributions, consider suggestions from the community.

https://python-ohdsi.readthedocs.org

We started the development for coupling our federated learning framework vantage6 to OMOP data sources. Related post: Feature Extraction for vantage6 Federated Analysis - Developers - OHDSI Forums