OHDSI Home | Forums | Wiki | Github

Bioinformatics / external resource conceptual vocabularies

All,

We’re working with a client on an extension of the v5 model to support metadata about the genomic (and other) data files that they have produced (many). This extension will enable them to have a record of their external resources (file, URI…), resource format (Illumina BAM…) and resource contents (DNA Methylation) with relationship to the patient and other related domain objects via fact_relationship (visit, specimen).

I’ve been use the NCBO BioPortal to identify candidate vocabularies / ontologies that would suffice for my needs. So far, it appears that the following ontologies foot the bill (in this order):

  1. Software Ontology (SWO) - specifically the “information content entity” tree
  2. Eagle-I Resource Ontology (ERO) - leverages SWO

Can anyone share the process for review and inclusion of s new terminology into CDM Vocabularies? I would love to be able to include this into my local DB following the accepted processes leveraged for the current vocabularies.

Thanks,
Bill

Bill:

Questions:

  1. What do you need “external resource” for to model the data?
  2. “Resource content” - are you trying to cover the type of variants?
  3. Why fact_relationship? Why not just person_id and date? The fact_relationship shouldn’t create a new alternative way to link tables together. The most common one should be Person, and then everything that happens to her.

The process we were anticipating is the following:

  • Define the problem (done),
  • Define an analytical use case (partially done, what genomic information do you want to capture and what for?)
  • Provide a strawman
  • Try it out
  • Refine
  • Build the vocabulary ETL into concept_stage, concept_relationship_stage, concept_synonym_stage.
  • Test and release
  • Smile.

C

Our customer has several thousand fully sequenced trios of patients (WGS, variants, RNAseq) for a long term longitudinal study. The metadata for these genomic files is available in their data warehouse. Their bioinformatics team has a workflow where they identify cohorts of patients using established cohort designations and through queries against their EHR data (demographics, visits, specimen…) and then select files of interest for investigation on their HPC cluster. We believe that by adding a “resources” table to the model we can better enable the identification of desired 'omics data to enable their bioinformatics pipelines while getting their data into the CDM where we can help them leverage the outcomes related science that is being created in OHDSI.

We also hope to pull some discrete 'omics information into the CDM model (SNPs perhaps). The resources table allows us to link out through the use of URIs to external annotation resources, like dbSNP.

I’m trying to model for the fact that these resources (files) can have the same format (gz compressed) but may contain different data (Indels, BAM, SNP, Genome…)

Mapping to person is great, but when dealing with specimen collections and labs across long term trials our clients are requesting the ability to more easily identify pre, during and post treatment data. I’m not really a fan of expecting dates to match especially when the patient / visit data is from EHR and labs/specimens are from LIMS.

Bill

Bill:

Interesting. Questions:

  1. What are you trying to achieve? Do you want to model genetic variants, add this to the CDM and than use all the goodies from the method group to find out causal relationships?
  2. You are saying “trios of patients”. By “trios” do you mean families, or do you mean the data types WGS, variants and RNAseq?
  3. What are the “resources” you are mentioning? I understand there are indels, SNPs, but BAM and Genome are essentially everything in the sequence. Do you want to just have the SNPs and other variants, or the entire sequence information? the latter would probably break the disk. The former would be an interesting challenge we should take on. But even the SNPs would be big, in the order of 20k per patient.
  4. The last question about the specimen dates: Why is this a problem? We know when the treatment happens, any specimen after that is collected after that. Doesn’t have to match more than that. What am I missing?

Bigger questions: Should we start a session and bring these people in? I would contribute folks who know this stuff in and out, and we could ask Vojtech and I think Nigam is also putting the toes into the water.

Let me know.
C

Great valuable questions and answers. Thanks Bill and Christian

Here - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3824370/ - the author mentions that " For example, although SAFTINet does not have a requirement to represent primary “-omics” or sequence data, the OMOP model is robust enough to accommodate such data in future."

Questions:

  1. Can the v5 model support the storage of WGS (entire sequence information) including the related vocabularies? If yes, then a brief description of how-to or a reference(s) to online documentation will be of immense help.
  2. If the answer is ‘no’ to the above question, then is there a plan in the OHDSI’s roadmap to include the ability to incorporate 'omics data including the multitude of vocabularies used in this area?
  3. Slightly off the Genomics topic - is there an example showing how to populate the “cohort_definition” and “attribute_definition” tables? I am interested to understand the kind of data that these tables can hold, and how to generate them.

Many thanks,
Sekhar H.

Hi Sekhar:

Yes, you nailed the gist of this conversattion. Let me answer them one by one:

  1. Today - no. We have nothing for genomic information. We are thinking of modelling this out and adding this to the model. This is what’s happening right now.
  2. There is. The roadmap for addition works like this:
  • is there a type of analysis that would involve these new data?
  • If yes, provide a use case so we can start the modeling.
  • If possible, produce a strawman solution.
  • Let the community discuss it
  • Refine and release.
  1. Good question. Let me get back to you on that. Definitely for the cohort definition, but I am not sure folks have created attribute_definitions yet.

C

This is a funded pilot project. First, we’re trying to map their current data model to CDM in order to get them in line with our product and the features enabled by OHDSI/OMOP. The use case that we’re working on is to enable cohort discovery by bioinformatics team against CDM and allow them to obtain the file locations for external 'omic data of interest to enable them to pull files down to their HCP environment for analysis.

Ideally, this would lead to extraction of more discrete data elements that could flow back into the CDM as measurements/observations/diagnoses (specific SNPs, biomarkers… )

Families. I’ll be leveraging fact_relationship for this.

The resources are the file location on their storage environment along with several metadata fields describing the resources.

This is a client request from our oncology use case that is heavily leveraging specimens. They want to easily identify specimens from specific periods of a clinical trial (pre, cycle N, post…). The visits/procedures come from one system and the specimens from another. I’m trying to control for variability in dates across the systems.

Bill

How do you do this intercalation for putting your answers into the middle of my questions?

Here is what scares me: “The use case that we’re working on is to enable cohort discovery by bioinformatics team against CDM and allow them to obtain the file locations for external 'omic data of interest to enable them to pull files down to their HCP environment for analysis.” The words “pull files”. Why wouldn’t we model the content. The analysis of these data happens by querying the relational data, and then doing some fancy stats with it. Files are notoriously inobedient and you can’t enforce any structure, which makes standardized analyses impossible. Essentially what you said in the second sentence (“Ideally, this would lead to extraction of more discrete data…”).

Why are the family “trios”? Why not quartetts? Or families of 12?

Resources: I am very wary of introducing files and storage details as part of a model. If we want to model the genomic data, we would just do that. Can’t be that hard. You got a bunch of different variant types, and you got locations, and maybe you got quality and coverage information, and you are good to go. There is a company who sells this for a living called Knome. Do you know them?

“They want to easily identify specimens from specific periods of a clinical trial (pre, cycle N, post…)” - I understand. However, I would not make that part of the model as the model already supports it. Inistead, I would create standard queries answering the needs these guys have.

Who are they anyway? Or can’t you tell?

C

I select the text in your post and click the “quote reply” popup link that appears. Then I manually control the quote markup.

This organization has several thousand genomes at a petabyte scale. All of this data is never going into a RDBMS as discrete data. In this case we’ll likely leverage CDM for clinical data and most of the extractable discrete data and other means for the raw genomics.

The initial study was for mother/father/child trios. As often happens, some of these family groups are expanding.

Don’t know of Knome. You don’t need to think of resources as just files. They could be URLs to external annotations as well (http://www.snpedia.com/index.php/). This type of resource is the real target.

Yeah. Queries and cohorts of specimens are they other approach.

Can’t say at this point, but we’ve made it clear that there is a ton of work underway with CDM. We’ll raise the option of bringing in some OHDSI folks and see if they are interested.

Bill

Thanks, Bill. I learned something.

Gotcha. You want to link the raw sequence files. Yes, we do’t want them. We want the result of the variant calling here, as the earliest work product. Essentially, whatever goes into the VSD files. Would you agree?

But of course we could have a field to identify raw data information. The actual infromation would then reside outside the CDM. That would work. LIke hte PAYER_PLAN_PERIOD - we don’t have any payer information per se either.

So, you would be fine with the FACT_RELATIONSHIP solution?

Understood. We could add that to the new GENOMIC_VARIATION table.

Yeah, bring them on. Tell them we don’t bite, we are very friendly.

:smile:

Thanks for clarifying, Christian.

I think we will need the ability to store Genomics sequence and functional data. Let me try to clarify this: (there are many open-ended questions for which we may not have an answer yet; but, I guess, this will provide you a direction to think and advance) -

Purpose: (a) Perform complex query and analyses, (b) browse the data, and © possibly provide some visualization.

Data models in this area can be broadly classified into two categories - broad and deep.

A broad database model stores essentially a single kind of data, but stores such data from many organisms. Examples of broad databases include Swiss-Prot (for protein sequences), PRINTS (for protein fingerprints) and WIT (for pathway data).

A deep database model focuses on one or a small number of species, but stores many different kinds of data, generally including both sequence and functional data. Examples of deep databases are MIPS, SGD and YPD.

The requirement that I foresee is for a deep database model, in that it stores many different kinds of data from a single organism (say, human being).

The work you are doing now to extend the CDM should not specifically seek to subsume the functionality of existing databases, many of which do a good job at storing, organising and disseminating biological data. Instead, the emphasis should be on the close association of analyses with the stored data, so that users interact with the database principally in terms of the analysis tasks they want to carry
out, and not so much in terms of the stored data.

For example, when analysing gene expression data, it may be useful to have access to the sequences
upstream of the genes, or to the cellular location of their protein products.

Some analyses that I will like to see supported in CDM are:

  1. Rather than analysing Transcriptome data in a spreadsheet and asking which genes are up or downregulated, it should be possible to ask which mRNAs encoding membrane-associated proteins are up-regulated.

  2. Relating Gene Expression to Gene Structure.

  3. Relating Gene Expression to Cellular Location.

  4. Relating Gene Expression to Chromosome Position.

  5. Relating Regulatory Sequences to Protein-Protein Interactions.

Hope this helps.

Also, looking forward to get a solution for “cohort_definition” and “attribute_definition” types of data.

P.S: Do you have a timeline in mind to incorporate sequence and functional information data into the model? If so, would you be kind enough to let me know?

Many thanks,
Sekhar H.

Sekhar:

Wait. MIPS, SGD and YPD contain general curated information about the species. While the CDM contains the concrete concrete information about individuals, which you then can study in cohorts. So, I was thinking the extension of the CDM would contain the genetic variants of individual subjects, so we can start doing causal relationship detection between these variants and the outcomes (phenotypes). Is that not what you have in mind? The YPD stuff at best would be in the vocabulary in my model.

I like these. They are the use cases we need. However, now I am confused you are introducing expression data. I was under the impression we are talking germline or somatic mutations or other variants. That’s what Bill was mentioning (WGS, variants, RNAseq), even though you could use RNASeq to derive expression levels.

Christian -

Sorry for the radio silence!

My point was: effective analysis of genome sequences and associated functional data requires access to many different kinds of biological information, regardless of the species (human, non-human etc…). What will be interesting is to see an extension to the model that integrates Genome sequence data (WGS, SNP, CNV, RNASeq etc…) with functional data on the Transcriptome and on protein-protein interactions in a single data warehouse.

Hope this makes sense.

Thanks,
Sekhar H.

Sekhar et al.

I am thinking whether we should open a real little sub-workstream to solve this problem. This will come to us, even though few people have the data. But that should be a matter of months till they start popping up and folks will want to be able to do standard genotype-phenotype studies.

Any appetite for that? In January? We could do in person, or we could start virtual. I got a few people in the company who would be interested to collaborate, and who know NGS inside out.

Let me know.

Christian

Feel free to add a section to the wiki for Bioinformatics / Genomics under Project & Workgroups. Would be useful to keep organized links and conclusions of the forum discussions.

ad bioinformatics subgroup: I think we should start as a virtual group first. I would be very interested to join and contribute !

In our NIH CC repository, we have 800+ exome data in VCF format (and 1000 more comming in 2015) (as files in BLOB and also as rows in a designated variation table) and a genomic BI report (in Cognos) where you can combine phenotype with genomic data.

(see the report in a video here (jump to time 9min and 30sec)
http://videocast.nih.gov/summary.asp?Live=15002&bhcp=1

Excellent, Vojtech. Will set it up.

What kind of basement wall is he standing in front of? :smile:

[quote=“Christian_Reich, post:10, topic:157”]
Understood. We could add that to the new GENOMIC_VARIATION table.

Curious where OHDSI is with this table? Also, what vocabularies are being used for it?

Has there been any talk or communications on continuing this work? We have started to investigate the possibility of mapping this data and this thread actually provides some interesting use cases to work towards. I did not see a project created on the projects page as recommended.

t