OHDSI Home | Forums | Wiki | Github

Question about SNPs data handling in the OMOP CDM


Our research group has been using the CDM for a while now, and with some recent developments, we are going to get access to genotype data of several thousand patients with the aim of using this data for SNPs analyses.
We are interested in implementing this SNPs data to the CDM, and with some searching in these forums, I have come to know about Genomic-CDM. Is there anyone here using it on a regular basis for SNPs data analysis?
We would appreciate it if we can receive some guidance as well as tips and tricks regarding this matter from an experienced user.

Thank you in advance.


What kind of SNP data do you have? Somatic for oncology, or germline for all sorts of diseases?

Hi Christian,

We have germline for all sorts of diseases.

I see. Right now, we have made a collection for cancer. If we wanted to extend that we should figure out how to create a list of meaningful SNPs. Do you have anything like this?

Hey Christian,

I am dearly sorry for the late reply.¨
Is the collection you are talking about implemented to Genomic CDM?¨

We have some ideas for the SNPs we would like to analyse in our projects, mainly related to immune cell or function related genes (interleukin receptors, toll-like receptors … etc.)

Does this answer your question?

Sincerely yours, Doruk

No. It’s implemented through a Measurement vocabulary called OMOP Genomic. It contains variants, either of defined sequences or generally of a gene. There are no categorical concepts like families of genes (interleukin receptors) or pathways. There are also no concepts for gene or domain functions that may be affected.

Show us what you have.

Dear @Christian_Reich , @LawrenceArcher and all,
I am continuing the topic on this question because I think my questions are mostly relevant to the SNP data handling in OMOP CDM.

Firstly, please let me briefly describe the type of data we are working on. We are working on a healthy birth cohort where the objective is to study how conditions in pregnancy and early childhood influence the subsequent health and development of women and children. The subjects in the cohort consist of mother and child. We have the demographics profile, biomarkers profile, the maternal and child metabolic and body composition, the sleep pattern, life event and social relationship, paternal factors, imaging and omics data among others. The omics data include genome, transcriptome, lipidome, proteome, etc.
In the cohort, some of the subjects (women) developed GDM (gestational diabetes mellitus) and we have also observed some Type 2 Diabetes patients in the follow-up. Therefore, we are interested not only in the variants that are linked to the GDM, but also the risk factors contributing towards postpartum Type 2 Diabetes development.

We have already put some effort into mapping the demographics profile, clinical measurement as well as other observations to the OMOP CDM. Additionally, we also wanted to map the omics data to the OMOP CDM. In this effort, we start our journey with mapping the genomic data first.

To give you an idea, our genomic data consists of not only array genotyping data, but also WGS (whole genome sequencing) data for some of the subjects. So, as you can see, we are actually getting the germline variants of the subjects. The question is then how we are going to map these germline variants to the OMOP CDM (particularly based on OMOP Genomic vocabulary). You may already have an idea of the difficulties we are facing as what we have described in the previous meeting in Dec 2023 if you still remember. We are happy to see the latest release of the OMOP Genomic vocabulary which resolved some of the issues that we faced in the previous issue.

So far, we have established a preliminary workflow on mapping our genomic data to OMOP Genomic together with non-standard genomic vocabulary (which we created internally). We shared this exercise (experience) with the OHDSI APAC community in the 4th April meeting. After getting some feedback, we reevaluated our approach and hence come up with some questions that we would like to bring up.

I hope the above description gives you an overview of what we have, and the below questions (as well as points of discussion) will revolve around it. I am still learning, and I could be wrong in certain aspects, so please correct me if that’s the case. Thanks so much.

Question 1: Do you foresee that we can use the OMOP Genomic for germline mutation as well? If yes, may I know what is the best approach to record the homozygous REF allele? If not, is there any plan to distinguish the germline mutation from somatic mutation? Recently, I have realized that in LOINC vocabulary there are concepts that could be used to record the different genotypes (in the field value_as_concept_id). For example: concept id 36660258, which can take the answers of “A/A (homozygous)”, “G/A (heterozygous)” or “G/G (wild type)”. I am wondering if a similar approach can be used for recording of germline mutation using OMOP Genomic vocabulary. Any advice is highly appreciated.

Question 2: Related to 1st question, I have a question regarding the approach for “Result recording” as described in this document ([Genomic Variants in the OMOP CDM – August 8, 2023](https://ohdsiorg.sharepoint.com/:b:/r/sites/Workgroup-Oncology/Shared Documents/Oncology -Omics Subgroup/Genomic WG Goals and Status Reports/Genomic Variants in the OMOP CDM.pdf?csf=1&web=1&e=uXrzr2)). It is recommended that for “Result recording”, you MUST fill in the field value_as_concept_id as “Positive (concept_id=9191)”, “Negative (concept_id=9189)” or “Equivocal (concept_id=4172976)”. May I know how this is defined? What is the meaning of “Positive” and “Negative” in this case? Is it “positive” with mutation? Or “positive” in the case of phenotype or drug effect observation?

Point of Discussion 1: I watched the video recording ([2024 Onc WG Genomic Meeting Series_4th Tuesday-20240326_090842-Meeting Recording](https://ohdsiorg.sharepoint.com/:v:/r/sites/Workgroup-Oncology/Shared Documents/Oncology -Omics Subgroup/Recordings/2024 Onc WG Genomic Meeting Series_4th Tuesday-20240326_090842-Meeting Recording.mp4?csf=1&web=1&e=wI6TNI)) with respect to VRS. We are glad to know that there is an ongoing discussion on this topic. In fact, this topic is currently in discussion within our local community on representing genomic data in GA4GH VRS schema. Is there any plan to integrate VRS in KOIOS (@LawrenceArcher) for mapping to OMOP Genomic vocabulary? I think this will be very helpful in further strengthening the capability of KOIOS for mapping to OMOP Genomic standard concept.

Point of Discussion 2: From the document ([Genomic Variants in the OMOP CDM – August 8, 2023](https://ohdsiorg.sharepoint.com/:b:/r/sites/Workgroup-Oncology/Shared Documents/Oncology -Omics Subgroup/Genomic WG Goals and Status Reports/Genomic Variants in the OMOP CDM.pdf?csf=1&web=1&e=uXrzr2)), I have learned that the desired variants to be included are “Clinically relevant variants (driver mutations, frequent correlates, mutations relevant for drug effect)”; which is great. We are thinking if the OMOP Genomic vocabulary can be extended to include the list from ACMG SF v3.2 list (PMID: 37347242). Not sure if anyone has mentioned this before, but we think that it is a good source of list as it consists of 81 genes which have been recommended as the minimum list of gene-phenotype pairs for opportunistic screening to facilitate identification and/or management of risks for selected genetic disorders through established interventions aimed at preventing or significantly reducing morbidity and mortality (PMID: 27854360). Based on our current observation, there could be approximately 40,000 additional risk variants to be included in the OMOP Genomic vocabulary.

Sorry for the lengthy write-up and questions.
We would be happy to have either (1) meeting for discussion; or simply (2) communication through this forum to proceed further.

Thanks very much for your input, support and advice.

Best regards,

Hi @wint:

Welcome to the family. And as a hint: The likelihood and response time to a OHDSI Forum post is reverse proportional to its length. :slight_smile: But let me create the exception that proves the rule:

I don’t see why not. We have been focused on the somatic mutations relevant to cancer. Because that use case popped up first. But folks have been asking about germline mutations, and this should be picked up.

There are two approaches to variants: The Closed World approach of creating an OMOP-specific repository of clinically relevant variants that can be then used in typical observational research manner as covariates of various statistical analytics, or the Open World approach of allowing anything that comes in: GCDM. The latter is an OMOP Expansion, meaning, it is not a ratified part of the CDM proper but is designed to work together with the rest of the model.

I think you are talking about the former, meaning, you would like to add variants to the OMOP Genomic vocabulary, relevant to your line of research. Should not be a problem if you can do the following:

  • You need to create a repository of clinically relevant variants that doesn’t blow up the OHDSI Vocabularies (i.e. <0.5M, if possible a lot less), but covers what is needed for your domain.
  • You need to solve the standardization problems:
    • Currently, we use HGVS notation. Koios can also process VCF, but not yet VRS.
    • The reference sequence must come from MANE. Accession numbers not in MANE need to be mapped, in the worst case using sequence alignment.
    • The modality of the variant (DNA, RNA, protein) must match what is measured. The accession numbers must point to the right modality (e.g. genomic variants should have reference genome accession numbers and coordinates).
    • Mutation definitions must be de-duped and disambiguated, such as deletions, substitutions, frameshifts, duplications, insertions, and extensions.

Since it has little relevance to oncology, we currently have no agreed upon way to distinguish germline and somatic other than to pre-coordinate that into the name. It should work in my opinion. We also need to solve the problem of homo or heterozygosity, as you mentioned.

The former. This is to record facts about variants present in the patient.

Not to throw a wet blanked on this, but to point out the issues we are encountering all the time: The list contains genes, not actual variants. In the paper, the table requires to record all “likely pathogenic” and “pathogenic” variants for these genes. Who defines what they are? For hereditary hemochromatosis, it actually provides a variant, HFE p.C282Y, and resolves the HFE to the accession number NM_000410.3, which is an obsolete non-MANE reference to an mRNA sequence. But the mutation p.C282Y indicates a replacement of a cysteine by a tyrosine at position 282 of the protein sequence, which is not referenced.

Bottom line: Lots of fun, but requires work.

Hi @Christian_Reich:
Sorry for my late response. Somehow, I didn’t see this post earlier, even though I came to check it up.
Thanks very much for your kind reply and I will try to keep my post short and precise. :grinning:

I am glad to know that this can work for germline mutations as well. And this is what we will proceed to do.

Learning from what OncologyWG has discussed on the OMOP Genomic vocabulary to include only clinically relevant variants with evidences, our strategy for the moment would be to first (1) perform standard discovery approach to look for variants of interest in our domain; and subsequently (2) add the variants of interest as non-standard genomic vocabulary. Finally (3) we will submit a request to add them into the standard OMOP Genomic vocabulary. If you think there is a better strategy, please let us know. Thanks very much.

Is there any plan on mapping VRS? (either as computed identifier or VRS json?). Also, what would be a better strategy? To add VRS computed identifier as the synonym or use Koios for the mapping? Is there any discussion going on internally? How about the Categorical VRS?

Thanks for this information. I have mistakenly used to think that we can only capture the positive case (i.e. case with mutation). Since “positive” means positive with mutation and “negative” means no mutation, that means we can capture both the reference and alternative alleles for the subjects. Do I understand it correctly?

Perfectly understand it. Thanks @Christian_Reich

Precisely. :grinning_face_with_smiling_eyes:


Is that used that in the field? Do you have VRS-encoded variants coming out of your pipeline? I would start with HGVS.


Thank you @Christian_Reich for your helpful input.
Our plan is to standardize the pipeline to represent the variants in GA4GH VRS standard. I will get back to you again when we have more questions, especially on the VRS-encoded variants.

Thanks a lot.