Interest in sharing pretrained deep learning models from OMOP CDM?

michael.a.rosenberg · September 18, 2018, 2:04pm

Just joined the forum, so I apologize if this topic has been addressed elsewhere. We’re developing some projects using deep learning on OMOP CDM to predict clinical outcomes. I was curious if anyone has applied deep learning to OMOP CDM in supervised or unsupervised analysis, and would be interested in sharing weights that we might use as pretrained model? On the flip side, would anyone be interested in validating our model in your EHR as pretrained network?

It would be nice if we could identify some common pretrained models that could be applied for EHR data like AlexNet, VGGNet, Inception, ResNet, Xception have been used for ImageNet. If these already exist that would be very helpful, otherwise we’d be very interested in starting such an initiative.

farbodr · September 18, 2018, 8:09pm

I am not aware of any pertained models for healthcare but we are starting to experiment with Andrew Beam et al. recent work with embeddings.

‘In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts.’

https://arxiv.org/abs/1804.01486

The embedding file is available for download.

Rijnbeek · September 18, 2018, 8:35pm

@michael.a.rosenberg In the Patient Level Prediction workgroup we are looking into Deep Learning on top of the OMOP-CDM. We have recently added Deep Learning Algorithms to the framework we have developed:

I have a PhD student and a PostDoc working on this topic. Happy to discuss further.

Patrick_Ryan · September 18, 2018, 8:44pm

Hi @michael.a.rosenberg, welcome to the community. You are looking for our Patient-Level Prediction workgroup, led by @Rijnbeek and @jennareps. In their group, they have developed a standardized framework for designing, implementing, and executing predictive models (whether you are using deep learning or any other algorithm), and have developed an open-source package for codifying this framework. Various deep learning algorithms are included in this package.

SCYou · September 18, 2018, 11:19pm

@michael.a.rosenberg I’m training the deep learning model to predict in-hospital mortality by using Korean tertiary teaching hospital CDM data, I can share the model after completion. But I’m not sure that you can use the pre-trained model from Korea because different health care system between Korea and other countries.

Regards to medical image, it would be much easier to implement distributed deep learning like previous paper ( https://academic.oup.com/jamia/article/25/8/945/4956468 ). Now, we’re planning to implement distributed deep learning to predict cardiovascular outcome by using fundoscopy images based on radiology CDM ( http://forums.ohdsi.org/t/radlex-and-standardization-of-ontology-for-radiology-procedures/).

michael.a.rosenberg · September 19, 2018, 12:10am

Thanks so much for your replies! I’ll definitely check out some of those articles.

To provide a little background, I’m coming into this from the clinical risk prediction side of things, and have recently started to expand into using machine learning and deep learning approaches in EHR data. My prior work was mostly with cohort data using Cox models. We’ve been playing around with Keras run in Python in Jupyter Notebook on the Google Cloud. Thanks to efforts by @mgkahn an entire copy of our EHR is available in the Google Cloud, so we can run models basically in the same location as the data.

My sense is that much like image recognition, EHR data has an underlying structure reflecting the types of patients and visits that in theory could be uncovered using deep learning methods. How much one EHR data might be similar or different from another (Korea, etc.) is an interesting question that could reflect treatment approaches, although ultimately patients should have some underlying similarity and so I (perhaps naively) suspect that there should be a way to capture this structure using similar approaches (CNN?). Anyway, we’re just getting started building our own models, but once we have something I’d definitely be open to sharing (as long as it’s within the confines of de-identified data, etc.). We’ve done a run with 3 layer stacked autoencoder so far, but that wasn’t in OMOP CDM.

Anyone else who might be interested, please feel free to post. Also any additional suggestions are more than welcome. I’m not sure whether the forum will be the long-term best way to move forward, but seems to be quite useful thus far. Thanks!

dlrubin · September 19, 2018, 1:07am

FYI, my lab was part of the group that developed the distributed deep learning method in that cited paper, and we’re nearly done completing a deployable implementation. Happy to collaborate with groups interested in trying this out in imaging use cases. But we’d have to make some adaptations if this needs to be compliant with OMOP…

SCYou · September 19, 2018, 2:54am

Thank you for your great work, @dlrubin
Actually Radiology CDM was designed for realization of distributed deep learning on various types of medical images. I thought the most important barrier against distributed deep learning on medical image is that the informations for phase or resolution are often missing)
Hence, R-CDM basically stores the data for 4 dimensional resolutions (width x height x slice thickness and position in the phase x time) and information of phase to standardize the medical images. I hope this work would be helpful for deployment and expansion of distributed deep learning.

dlrubin · September 19, 2018, 4:34am

Thanks for your reply, reply @SCYou. Can you explain a bit more exactly how you plan on implementing distributed deep learning, i.e., what participating sites would need to do in terms of preparing their image data? I’m not clear on how the images themselves vs. metadata extracted from them (and put in R-CDM format) would be stored/accessed.

SCYou · September 19, 2018, 4:49am

@dlrubin
Yes, we’ll released the whole process and the analytic code how to build deep learning model based on radiology CDM, soon (within a week, I hope). Then, you can understand how this model can work.

Andrew · September 19, 2018, 1:08pm

The prospect of collaborating on distributed deep learning for medical imaging and kicking the tires on the Radiology CDM is very exciting. I look forward to finding out more.

michael.a.rosenberg · September 27, 2018, 8:29pm

Thanks again for your replies! Do you know how I might go about checking out the Patient-Level Prediction workgroup that you mentioned? Also, any chance that they have a Python version of the PatientLevelPrediction package? We generally use Python (Sklearn and Keras) to develop our models.

Andrew · September 27, 2018, 9:01pm

The github site for the PLP package is: https://github.com/OHDSI/PatientLevelPrediction
It is python friendly.
The wiki for the Patient-Level Prediction workgroup is at: http://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:patient-level_prediction
The call schedule, connection info recordings of prior calls etc. are all there. Peter and Jenna are both the leads and the main creators and drivers of the package. It is an exciting and ambitious effort and a welcoming group.

Rijnbeek · September 27, 2018, 9:05pm

HI @michael.a.rosenberg

If you are at the symposium i am happy to update you. We are currently in the process of updating all the vignettes and release a new version that will be of interest to you.

We interact with Sklearn and Keras in Python through R.

See also my post above for the link to the paper.

Feel free to reach out to me directly.

Peter

michael.a.rosenberg · September 27, 2018, 10:11pm

Excellent, thanks! I’ll check out the package and paper, and look forward to the next version.

Unfortunately, I just learned about the symposium recently and won’t be able to make it this year. I’m definitely interested in the workgroup, and will try to get on the next call if possible.

aschuler · May 28, 2019, 6:50pm

Here are a handful of relevant papers:

I don’t see any reason why OHDSI shouldn’t have at a minimum a set of concept embeddings for everything in the CDM standard vocabulary, including conditions, procedures, labs, etc., not just from text.

MarkSamuelTuttle · May 28, 2019, 11:13pm

Thank you for posting these; very helpful. Giving a related presentation next week.

Matthieu · April 18, 2023, 2:19pm

Hi there,

Has there been any work in this direction lately ? Or any reason to let go this line of research ?

I am actually evaluating the transfer from embeddings built from medical claims (french insurance) to hospital data from Paris hospital (cohort of 200 000 random patients, I might try with 10 times these numbers if I manage to scale my code on the hospital compute infrastructure). The tasks are current stay los prediction, and next visit ICD10 chapter prediction.

I am considering publishing them with Omop vocabulary concept ids, if it is of any interest to the community.

Since, the insurance-based embeddings works as well as embeddings built locally, I am also interested in building distributed Omop embedding, taking inspiration from the SVD-PPMI decomposition described in Beam et al., 2019, which is easilly distributable. Would such project interest anyone here ?

schuemie · April 18, 2023, 3:02pm

Hi @Matthieu . I’ve also been working on this a little bit. I implemented the GloVe algorithm on CDM data here. It works, and I was able to show it sometimes improved prediction performance, but no real applications otherwise.

We’re currently working on getting CEHR-BERT running on various OHDSI data sources. See the CEHR-BERT code here. For this reason (and fitting pretrained models in general) I created the GeneralPretrainedModelTools, which leverages OHDSI’s DatabaseConnector R package to connect to a wide range of database platforms, but stores the results in local Parquet files, which can be used in Python.

Matthieu · April 20, 2023, 8:19pm

Hi @schuemie !

Thanks a lot for the pointers !

Really interesting to see that there is some interest in the community for these methods.

I am considering benchmarking the simple methods such as svd-ppmi or glove against more elaborated models such as cehr bert and will definitely look into the code.

Out of curiosity, on what type of predictive tasks did you benchmark the glove vectors ? I have some intuition that simple models trained on large volumes of data would be : a) be solid baselines, b) be super easy to use and share in the community.