Genomic Data in the CDM

Ajinkya_Patale · October 28, 2017, 2:24am

Please sign me up too.

clairblacketer · October 30, 2017, 4:00pm

Thank you everyone for the interest! Please fill out the doodle poll with your availability. Right now the times are all for eastern standard time so please keep that in mind when filling it out.

Clair

davidfasel · October 31, 2017, 4:34pm

Just some thoughts to get the conversation going.

Data model:
Vertical/EAV (similar to the existing data model of OHDSI), horizontal (Kyu Pyo Kim’s model), or a hybrid model (Seng Chan You’ model).

Basic variant info / Annotations:
What annotations should we include? Just enough data to define a variant, or include some annotation data? The arguments for including annotation data is that it is extremely useful for searching for variants or understanding why they were classified as pathogenic at the time. The argument against is that annotations can become obsolete and should be left in external data sources. Semi-static annotation data includes Gene name, functional type, accession, rsID, quality scores from the chromatograms, etc. Rapidly changing annotation data includes MAF and versions of pathogenicity scores (ex. CADD and Polyphen).

Interpretation / Report:
A single field with all of the interpretation concatenated into one string, or should we define multiple fields for the interpretation? Genetic pathology reports are highly variable and we have to create a data model that accommodates all of them without becoming overly complex.

Scope:
What type of variants should be stored? My opinion on this is that the focus on storing genetic data in OHDSI should be limited to variants with a clinical interpretation. These would include genetic reports from pathology or variants identified by a robust Clinical decision support pipeline with high confidence of pathogenicity. OHDSI is not meant to have thousands of variants per individual; there are plenty of other systems that are meant to deal with that kind of research data (DNANexus, GeneInsight, cBioPortal, etc.).

Use cases:

Search for all conditions in patients with a pathogenic variant in a specific gene. (Find new links between phenos and genes.)
Discover variants which have been changed from pathogenic to benign. This may be important for patient notifications.
New discoveries in PGx
Let’s come up with more!

I’ve written a document that tries to capture some of these thoughts. I’ve also created a spreadsheet of the potential fields of the data model here. The “Proposed” sheet shows a list of proposed fields, and their FHIR counterparts. This is an aggregation of fields that I have seen across many pathology reports. I’ve also created a brief presentation which includes notes for each slide. Slides 7-9 describe trying to use the existing CDM with minimal changes. Slide 10 describes a potential horizontal data model.

These documents are missing the refinements or alternate models presented by @KKP1122, Yurang_Park, & @SCYou. I will add them soon and try to represent their work as best as I can. If anyone would like for me to add you as an editor to these documents, then please message me.

Looking forward to the discussion.

davidfasel · October 31, 2017, 5:31pm

Could you send us a link to your poster / data model? I can’t find it in the list of posters from the symposium.

rimma · October 31, 2017, 5:38pm

Clair,

Thank you for organizing, sign me in please.

rimma · October 31, 2017, 5:52pm

David,

One important use case and modeling consideration: an absence of a variant may be as important for the analysis as its presence. The usual OMOP CDM convention “no record” means “NO” would not work in this case.

Looking forward to the discussion.

Thank you.

Christian_Reich · October 31, 2017, 6:45pm

Love the fact that this would be use case driven, and we don’t try to boil the ocean or recreate another variant calling and storage mechanism. Or fall into the “attic trap”, trying to store any potentially useful information by all means.

Don’t understand this. Whom do you want to notify of what?

Please do. I’d put in:

Create hypothesis generation or testing methods for connecting variants to any type of phenotype (could be Condition, but also timing of things, severity, pharmacological effect etc. We are the only ones who would be able to pull that off.

Can you define where do we get that from? Generally, in OMOP CDM we have no verbatim texts (with very few exceptions), so annotations would have to be conceptualized.

How do you mane that call?

SCYou · November 1, 2017, 12:15am

With the help from @ShinSeojeong and my colleagues, our first draft for Genetic CDM was released at GoogleDrive
This model is developed on the basis of ISO standard for reporting NGS result (ISO/TS 20428, ‘Health Informatics-Data elements and their metadata for describing structured clinical genomic sequence information in electronic health records’)

This is our first draft and we need your thorough review and comments!
I agree with @rimma 's comment
it is important to know ‘there is no mutation in certain genes’. We need to figure out how to add information of target genes in targeted NGS.

Thank you for @davidfasel 's comment
Basic variant info / Annotations:
Basically, I agree with David’s thought. Annotation data is useful but annotation data can be changed rapidly and this is so huge. I think to leave this data in external data sources too, if it is possible. And that’s the reason why we add another table for annotation or basic variant info.

Interpretation/Report
I think we can store the information of original pathology report and genetic pathology into ‘note’ table in existing CDM.

Scope
As @Christian_Reich said, I think that it would be hard to define ‘limited variant with a clinical interpretation’ (In our model, the information for clinical implication should be stored in ‘variant_annotation’ table). I don’t think the data of thousands of variants itself is overwhelming for CDM compared with current CDM. We store every single device, medication, device and note in CDM now. The current variant_occurrence table in our model has 23 columns. And most of patients have single NGS result.

Use case
On-going project of mine is developing machine learning to predict outcomes in cancer patients by using combined information of genomic and clinical data. Owing to great contribution @Rijnbeek, @jennareps, @schuemie and their colleagues, it won’t that hard to build this by modifying feature extraction package and using patient level prediction package.

Another my ambitious goal is converting existing open genomic database in cancer patients into OMOP-CDM. by this, it is possible to leverage accumulated genomic and clinical database to generate better evidence for accumulating genetic and clinical information. Collaboration with oncology group is absolutely essential for this ambitious goal to capture information from existing oncology registries in OMOP-CDM

rwpark · November 1, 2017, 1:23am

Please sign me up.

llange5225 · November 1, 2017, 6:51pm

I would also like to join. Thanks!

mgurley · November 2, 2017, 7:33pm

Please sign me up.

KKP1122 · November 5, 2017, 3:49am

Dear David,

Sorry for the late reply. I had to check up with clinic during the weekdays after the absence for OHDSI.

Thanks for summarizing the models. Actually you pointed out many stuff that were discuss while we were building the final proposal. We think the various points are balancing between flexibility and operationality

We first tried to use the observation tables in the fashion the hybrid model declared. However, we quickly found out that it was too burdensome on the system and queries took forever. We work with mostly with a gene panel of 250 genes but some samples are from WES or WGS. Therefore, the observation tables became outrageous and would require over 3 gigabase (size of human reference genome) x 10 rows.

We then went our present model (horizontal model). As you have pointed out it is quite “rigid". However, we thought it included the core information and was “ simple" enough without losing important information. This allowed us to handle the large amount of data promptly. As the poster focused on the API, we did not mention that we allowed flexibility by an additional observation table that could include new types of measurements.
YuRang told me she would put up our API on GitHub tomorrow. I hope you can critique it as you have in the ppt; we really appreciate your comments.

Best regards,
KP

I add a point to point reply to your comments.
Data model:
Vertical/EAV (similar to the existing data model of OHDSI), horizontal (Kyu Pyo Kim’s model), or a hybrid model (Seng Chan You’ model).

Basic variant info / Annotations:
What annotations should we include? Just enough data to define a variant, or include some annotation data? The arguments for including annotation data is that it is extremely useful for searching for variants or understanding why they were classified as pathogenic at the time. The argument against is that annotations can become obsolete and should be left in external data sources. Semi-static annotation data includes Gene name, functional type, accession, rsID, quality scores from the chromatograms, etc. Rapidly changing annotation data includes MAF and versions of pathogenicity scores (ex. CADD and Polyphen).

We believe that the semi-static annotation data like the MAF file would be preferred as an input format. This was based on a hot interdisciplinary discussion focused on science and operation.

If we use VCF file and put the variants within measurements table and the annotation information into observation table and link it with a foreign key it becomes flexible; however, once the table size increases (e.g. WGS would go over 3 gigabases x 10 annotations per mutation) the parsing time becomes impractical. This could be overcome by including the annotation information within the input file as a MAF file. To circumvent the flexibility issues, we included the essential components (chromosome, start, end, reference allele, variant allele) within the measurement table and included the more in-depth data (e.g. hugo symbol, tumor_total_depth, tumor_reference_depth, tumor_alternative_depth, tumor_allele_frequency, normal_total_depth(optional), normal_reference_depth (optional), normal_alternative_depth(optional), mutation_status, HGVSc, HGVSp, strand, exon, intron, transcript_id) within a observation table to correspond to various data formats. To recognize the source/validity of the molecular diagnostics, we included a omics_meta table that included issues like panel version number and details on the test (e.g. WGS, WXS, targeted-seq, Sanger sequencing). This allowed us to include data from various sources and historical molecular data which was usually done with Sanger sequencing and immunohistochemistry.

Interpretation / Report:
A single field with all of the interpretation concatenated into one string, or should we define multiple fields for the interpretation? Genetic pathology reports are highly variable and we have to create a data model that accommodates all of them without becoming overly complex.

As aforementioned, this is a trade-off between flexibility and operationality. If we go out to define all the multiple fields for the interpretation, it simply takes too long for parsing as the observation table becomes enormous (e.g. WGS would go over 3 gigabases x 10 annotations per mutation). In addition, as prior pathology reports include Sanger sequencing and immunohistochemistry, the factors to include would just get longer and longer. Therefore, by annotation contents to the more robust elements that are being used in everyday research and practice (chromosome, start, end, reference allele, variant allele), practicability in parsing and communication would be spared. An additional observation table would handle the additional data so that information would not be lost.

Scope:
What type of variants should be stored? My opinion on this is that the focus on storing genetic data in OHDSI should be limited to variants with a clinical interpretation. These would include genetic reports from pathology or variants identified by a robust Clinical decision support pipeline with high confidence of pathogenicity. OHDSI is not meant to have thousands of variants per individual; there are plenty of other systems that are meant to deal with that kind of research data (DNANexus, GeneInsight, cBioPortal, etc.).

We think that all of the variants should be stored. This is because the present clinical decision system is limited. There are many instances when even the same variant is classified inconsistent (e.g. the same variant is pathogenic is one database while it is variant of unknown significance in another). When we use knowledge bases, different versions according to time will be another issue. Therefore, if we confine to storing variants to clinically significant according to today’s knowledge, we can lose opportunities for tomorrow’s patients.

Use cases:

Search for all conditions in patients with a pathogenic variant in a specific gene. (Find new links between phenos and genes.)

We are presently working on this. We would be glad to discuss on strategies.

Discover variants which have been changed from pathogenic to benign. This may be important for patient notifications.

As aforementioned, the significance of variants can change according to the up-to-date knowledge base of the time of analysis. Therefore, we think that this information should not be including within essential components, but rather the in-depth annotation table that should be updated regularly.

New discoveries in PGx

As the in-depth annotation table is subjected to regular update, we do not think this will be an issue.

Let’s come up with more!

We are discussing on adding pathologist comment, export functions and basic design templates of an annotation database that could be used across institutes.

cahilton · November 7, 2017, 10:01pm

Also interested.

tomwhite · November 9, 2017, 5:34pm

I’d like to join this too.

Jerome_Dixon · November 12, 2017, 9:25am

Just came across this:

Yurang_Park · November 13, 2017, 3:58am

Sorry for the late release because of API validation.
Releases common data model extensions and APIs created by AMC.
The link is as follows.
Genomic Common data model: https://github.com/yrpark/GenomicCommonDataModel
API: https://github.com/yrpark/GenomicWebAPI
I was not registered as a collaborator on OHDSI Github, so I first posted it on the public Github.

Please feel free to send us any comments or improvements on the Genomic CDM and API.

Sulev_Reisberg · November 13, 2017, 1:41pm

I’d like to join!

Vojtech_Huser · November 13, 2017, 5:08pm

a more precise link would be: https://github.com/yrpark/GenomicCommonDataModel/blob/cc274d0de0419050843c1478a3c9fb2ca44d5854/PostgreSQL/OMOP%20CDM%20ddl%20-%20PostgreSQL.sql#L702

hongna · November 22, 2017, 6:41pm

I am interested in this work, Please sign me up.

EranOr · November 27, 2017, 7:13pm

Hi Clair, I’d like to join.
Thanks.