Sorry for the late reply. I had to check up with clinic during the weekdays after the absence for OHDSI.
Thanks for summarizing the models. Actually you pointed out many stuff that were discuss while we were building the final proposal. We think the various points are balancing between flexibility and operationality
We first tried to use the observation tables in the fashion the hybrid model declared. However, we quickly found out that it was too burdensome on the system and queries took forever. We work with mostly with a gene panel of 250 genes but some samples are from WES or WGS. Therefore, the observation tables became outrageous and would require over 3 gigabase (size of human reference genome) x 10 rows.
We then went our present model (horizontal model). As you have pointed out it is quite “rigid". However, we thought it included the core information and was “ simple" enough without losing important information. This allowed us to handle the large amount of data promptly. As the poster focused on the API, we did not mention that we allowed flexibility by an additional observation table that could include new types of measurements.
YuRang told me she would put up our API on GitHub tomorrow. I hope you can critique it as you have in the ppt; we really appreciate your comments.
I add a point to point reply to your comments.
Vertical/EAV (similar to the existing data model of OHDSI), horizontal (Kyu Pyo Kim's model), or a hybrid model (Seng Chan You' model).
Basic variant info / Annotations:
What annotations should we include? Just enough data to define a variant, or include some annotation data? The arguments for including annotation data is that it is extremely useful for searching for variants or understanding why they were classified as pathogenic at the time. The argument against is that annotations can become obsolete and should be left in external data sources. Semi-static annotation data includes Gene name, functional type, accession, rsID, quality scores from the chromatograms, etc. Rapidly changing annotation data includes MAF and versions of pathogenicity scores (ex. CADD and Polyphen).
We believe that the semi-static annotation data like the MAF file would be preferred as an input format. This was based on a hot interdisciplinary discussion focused on science and operation.
If we use VCF file and put the variants within measurements table and the annotation information into observation table and link it with a foreign key it becomes flexible; however, once the table size increases (e.g. WGS would go over 3 gigabases x 10 annotations per mutation) the parsing time becomes impractical. This could be overcome by including the annotation information within the input file as a MAF file. To circumvent the flexibility issues, we included the essential components (chromosome, start, end, reference allele, variant allele) within the measurement table and included the more in-depth data (e.g. hugo symbol, tumor_total_depth, tumor_reference_depth, tumor_alternative_depth, tumor_allele_frequency, normal_total_depth(optional), normal_reference_depth (optional), normal_alternative_depth(optional), mutation_status, HGVSc, HGVSp, strand, exon, intron, transcript_id) within a observation table to correspond to various data formats. To recognize the source/validity of the molecular diagnostics, we included a omics_meta table that included issues like panel version number and details on the test (e.g. WGS, WXS, targeted-seq, Sanger sequencing). This allowed us to include data from various sources and historical molecular data which was usually done with Sanger sequencing and immunohistochemistry.
Interpretation / Report:
A single field with all of the interpretation concatenated into one string, or should we define multiple fields for the interpretation? Genetic pathology reports are highly variable and we have to create a data model that accommodates all of them without becoming overly complex.
As aforementioned, this is a trade-off between flexibility and operationality. If we go out to define all the multiple fields for the interpretation, it simply takes too long for parsing as the observation table becomes enormous (e.g. WGS would go over 3 gigabases x 10 annotations per mutation). In addition, as prior pathology reports include Sanger sequencing and immunohistochemistry, the factors to include would just get longer and longer. Therefore, by annotation contents to the more robust elements that are being used in everyday research and practice (chromosome, start, end, reference allele, variant allele), practicability in parsing and communication would be spared. An additional observation table would handle the additional data so that information would not be lost.
What type of variants should be stored? My opinion on this is that the focus on storing genetic data in OHDSI should be limited to variants with a clinical interpretation. These would include genetic reports from pathology or variants identified by a robust Clinical decision support pipeline with high confidence of pathogenicity. OHDSI is not meant to have thousands of variants per individual; there are plenty of other systems that are meant to deal with that kind of research data (DNANexus, GeneInsight, cBioPortal, etc.).
We think that all of the variants should be stored. This is because the present clinical decision system is limited. There are many instances when even the same variant is classified inconsistent (e.g. the same variant is pathogenic is one database while it is variant of unknown significance in another). When we use knowledge bases, different versions according to time will be another issue. Therefore, if we confine to storing variants to clinically significant according to today’s knowledge, we can lose opportunities for tomorrow’s patients.
Search for all conditions in patients with a pathogenic variant in a specific gene. (Find new links between phenos and genes.)
We are presently working on this. We would be glad to discuss on strategies.
Discover variants which have been changed from pathogenic to benign. This may be important for patient notifications.
As aforementioned, the significance of variants can change according to the up-to-date knowledge base of the time of analysis. Therefore, we think that this information should not be including within essential components, but rather the in-depth annotation table that should be updated regularly.
New discoveries in PGx
As the in-depth annotation table is subjected to regular update, we do not think this will be an issue.
Let’s come up with more!
We are discussing on adding pathologist comment, export functions and basic design templates of an annotation database that could be used across institutes.