OHDSI Home | Forums | Wiki | Github

Omics data

Has anyone developed prediction models that use -omics data (transcriptomics, proteomics or metabolomics, etc) to predict outcomes in OMOP’d EMR data? I’m interested to know methods that researchers have used,

@CSung: Would you expect them to be different than the normal PLP methods?

A number of research groups have collected longitudinal samples from ~3,000 Covid-19 patients at different stages of disease under different research protocols. Different -omics evaluations are being done, generating high dimensional data, too much to bring into the OMOP CDM Measurement Table. Each -omics technology seems to have its own bioinformatics pipeline. The hope is to identify some combination of biomarkers associated with severity of disease. Yet the patients’ clinical and demographic data are not harmonized. The OMOP CDM may be one approach to standardizing across all the different studies, though it would require re-consenting the patients to link their EMR to the research study (creating a limited data set) AND getting the researchers to do an ETL to the OMOP CDM. The benefit is that the entire record of the patient’s history could be part of developing a machine learning algorithm, not just what the researchers anticipated a priori was important to record in a CRF (each group using different CRFs). I was curious however, how to bridge between -oMics data stored in one bioinformatics platform and EHR data in OMOP CDM to create prediction algorithms. Can such studies still be done in a federated system? Would appreciate knowing other people’s experience (e.g. All of Us initiative or any study using array data, proteomics, metabolomics?) Apologies if this is a very basic PLP question.

@CSung:

Understood. There are essentially three approaches now floating:

  1. Phenotype-only data: In this approach, a local pipeline for variant calling and interpretation is used for the omics, while the CDM delivers the phenotype.
  2. Reference omics: In this model, only variants well established and relevant to a therapeutic area (currently only oncology) are represented as standard Vocabulary Concepts, and your pipeline’s output is mapped to it.
  3. Full omics: In this model, both the phenotype and all the omics data are represented in a standardized model.

In all three, the OMOP CDM is used for the phenotype: definition of the intervention (usually drugs) and outcomes. 2 can be used if there is a set of well known variants you are expecting, but if so you can use the full set of standarized analytic methods OHDSI published. In your case, you need to consider any omic variant you find in the data as related to the outcome Covid. Which means, you should pick 1 or 3. In either case you need to develop and apply your own analytic methods. In 1, you also cannot do that easily in a network, while 3 enables that, but also requires extra work to adapt your pipeline to the model.

Let us know what you decide and how it goes.

@Christian_Reich Thank you for laying out the different options. I think option 1 is going to be the path that will work best for this particular case

Didn’t find a place where the sequence data itself to be landed meaning that it’s in the external repository still?
So what would be “4” then?

True. But I think there is little appetite for anything going deeper into sequencing analysis and variant calling. There are tons of public and commercial offerings, like the Galaxy project.

True. I tried this one ~5 years ago. And my understanding is that they don’t manipulate the large-scale clinical data.

t