OHDSI Home | Forums | Wiki | Github

Use of Python in OHDSI projects

Hello Everyone,

I see that in OHDSI group/github R is the most widely used language for everything.

I use Python and would like to know whether there is any group or projects where people with Python skills can volunteer

Is python being used for any development etc?

Where can I see the work that is being done currently using Python?

1 Like

PatientLevelPrediction uses Python: https://github.com/OHDSI/PatientLevelPrediction/tree/master/inst/python


If you start one, I would join. Also AllOfUs teams are using Python a lot.

1 Like

At The Hyve, we often use a combination of Python and SQL to implement OMOP ETL. Please reach out if you are looking for an ETL example.

Hi @MaximMoinat - Thanks for the response. I was looking for the info to participate in any open source activity on Python under OHDSI community. If there are any projects where you all are working together to build a tool/utility for the benefit of community, then I would be keen to participate as well

Hi! Being mainly a Python dev, I would also be interested in being able to use some tools in Python. For example the cohort creation and extraction tools would be nice to have without explicitly calling R. Would be willing to spend some time in this.


I am interested to help in creation of utilities/tools using Python as well.

1 Like

Hello everyone. I am also a Python dev and have been working on an open source package called InspectOMOP for interfacing with the CDM. Does anyone know how to get a project put on the list of GitHub repositories? In case anyone is curious, here is a link to the repo. https://github.com/jbadger3/inspectomop



Easy. You get a repo and put it in there. Is it ready?

Hi Christian,
Yes it’s all set. I didn’t see any way to add repos from the OHDSI github repo page…my guess is you have to be part of the github org…can you help me with the next step?

Thanks! Jon

@jbadger3: I just created the InspectOmop repo and made you admin

1 Like

Hi! I have been using Python on OMOP for years and I have already many tools available. It works with Dask / Pandas. Would it be of interest?

@schuemie & @Christian_Reich: Thanks for the help guys. Repo is posted.


What do you have?

All you need for:

  • creating cohort from a json
  • getting treatment pattern for a cohort of patients (based on selected ingredient) with specific rules for oncology treatments
  • doing features generation and selection, patient matching, etc…

Part of it is here but required some cleaning: https://github.com/anthonydubois/rwd_analytics/

Could you help me understand the goal here? Is the idea to

  • provide example Python code of how folks can talk to the CDM,
  • to provide a library of Python functions for others to use, or
  • to start creating a ecosystem of OHDSI analytics in Python

All three are valid, and not mutually exclusive.

Hi there.

I’m a Python dev and am currently working on an Oncology ETL that will be pure python.

I’m building a pipeline app that focuses on automating not just the ETL but also the ETL authoring and unit-test generation. The idea is to focus on csv configuration files that can be authored by clinical users so all direct mappings or vocabulary translation mappings can be auto-generated.

At the moment my app is highly interconnected and specific to the project, but I think there are portions of it that could be split out and made generic in order to share e.g. I have my own 6.x (oncology extension) SQLAlchemy ORM that is auto-generated from the postgres ddls.

Also working on training NLP models for de-identification of clinical notes and NER extraction to include in our mappings.

Very keen to stay in the loop for these discussions.

1 Like

Hi @gkennos, our group is also using Python and is also trying to do everything that you mentioned. Let s keep in contact. I will be very happy to contribute to your repos.

[Would a Python implementation be friendly to Redshift and Impala?]
Hi all,
I am trying to perform ETL and Data quality check (Achilles/DQD) for data from EHRs. I am not proficient in R and SQL. And I am wondering whether it is worthwhile rewriting OHDSI’s code from R to Python. Especially if I want the Python code to easily scale to Redshift and Impala.

From what I understand, the current R packages seem to serve the purpose of automation, they help us connect to the database, then implement a series of SQL files, then feed the output to the web dashboard.

I also notice that Achilles can work with Impala, and Data Quality Dashboard can work with Redshift. I guess it is because the SQL code is somewhat ‘universal’

In my current Python package, since we have less than 10GB of data, I am actually loading data from PostgreSQL to pandas dataframe, and implement all the table manipulations in dataframe, thus no SQL code in my package.
PRO: (1) I guess that python and pandas can perform more sophisticated mapping logic and quality check than SQL. (2) I am more familiar with Python than SQL
CON: In the long term, will this approach scale when we have ‘very big’ data and have to connect to Redshift or Impala?

Another related question is how ‘big’ is really ‘big data’, and when would we hav to move from PostgreSQL to Redshift or Impala? Some say that PostgreSQL can store up to 1TB of data.

It would be very helpful if someone can give us a limit or some perspective based on your experience.

@jbadger3 : You seem to be working on this issue in your repos. I see that you are using SqlAlchemy. Is this library ‘universal’ to different database backends?

Thank you a lot.


There is quite a bit to unpack in your post, but let me address a few of the topics that I am more familiar/comfortable with.

  1. ‘is it worthwhile rewriting OHDSI’s code from R to Python’. There are a number of packages from the OHDSI group that I think would benefit from an accessible interface with python. The problem here is that if you write a port for the work that has already been done you will be spending a large amount of effort duplicating functionality that already exists and it will be nearly impossible to keep the python and R versions in sync. That being said I think that projects to call into existing R packages from OHDSI (aka bindings) would be a real benefit to the community. There is a python package called rpy2 that might provide the groundwork.

  2. On scalability and big data.
    In terms of your question on ‘big data’ and how big is big I am not going to go into a pedantic discussion on what should really be designated as big data, but here is my informal definition. If the data you intend to work with won’t fit on a single storage device (needs to be sharded) and/or requires multiple computers to get the work done in a reasonable amount of time then you are working with big data. With my own experience I have operated on 250GB of EHR data using SQLite on a single machine without any issues…so assuming your 10GB doesn’t scale to multiple terabytes anytime soon I think you will be fine sticking with PostgreSQL for a while. The key to performance here is making sure your database is properly indexed.

  3. Working with the CDM using pure python and InspectOMOP gives you the freedom to be agnostic to the backend database, which has a real advantage if you want to transport your code to another institution for replication or if you end up wanting to switch the software you are using for backend data storage. SQLAlchemy gives you the ‘universal’ backend, so if you want to switch to Redshift in the future you should be able to do so (there is a third party package separate from SQLAlchemy for this).

Hope this helps a bit!