OHDSI Home | Forums | Wiki | Github

Gene and Biomarker ontologies in OHDSI

Odysseus have been performing quite a few conversions of EHR and TA specific (for example FlatIron / Oncology, GI, etc…) data sets into OMOP CDM. One area that has constantly required a quite intense custom mapping is mapping lab tests, specifically around biomarker and gene data. Talking to a few folks on our vocabulary team and outside, I understand that there is no comprehensive biomarker and gene vocabularies are available in OMOP Standardized Vocabs. LOINC does have some but it is incomplete and insufficient. Performing custom mapping of such data - while it does enabling researchers to perform local research but does defeat the goal of query interoperability. And using biomarker data in OMOP CDM queries has become very common.

Reflecting on my days back in Pharma TI/TV area, I remember there were quite a few Gene and Biomarker ontologies back in those days, some open source and some commercial. However, I admit it’s been quite a few years and things have changed quite a bit. So, opening up this discussion here:

  1. Is this an acute problem for others? Should one of our 2020 goals to bring a comprehensive Gene/BioMarker/Genomics vocab/ontology into OHDSI?

  2. What curated ontologies exist in this space today that we can bring into OHDSI?

@Christian_Reich @Patrick_Ryan @hripcsa @rimma @Dymshyts @Alexdavv @mik


I have been trying to advocate in the Genomics Subgroup following the lead of mCODE. See here:



The HGNC and HGVS nomenclatures have starring roles in mCODE.

HUGO Gene Nomenclature Committee (HGNC)


HGNC and HGVS have similar starring roles in the beta Genomic CDM. See here

I think the primary challenge with HGNC/HGVS will be figuring out how nomenclatures fit into the OMOP vocabulary model.


So glad to hear this @gregk. I’ve been reluctant to incorporate genomic ontology into OMOP vocabulary, because I’m not sure who will manage this after all. Now, you’re in, so I’d like to move this further.

Again, we need to think about the use case.

My use case is to build ‘transferable PLP model based on clinical and genomic features from NGS
It was possible to build ‘PLP model based on clinical and genomic features’ without OMOP genomic vocabulary. But it’s not transferrable (I cannot validate this model externally).

I think we need whole vocabularies of lists below as the minimum component for this.

  • Whole human genes. We can leverage HUGO’s ontology as @mgurley mentioned.
  • Variants in HGVS as @mgurley mentioned. I mean I need only categories of variants such as ‘Substitution’, Deletion’, ‘Duplication’, and etc under DNA level(ref). and so on under other levels (RNA and Protein).
  • Representation of other kinds of variants such as copy number variation, which is not supported by HGVS to my knowledge.

By concatenating the concept IDs from gene name (HUGO), position of variants based on gene, and the category of variants (HGVS), it’s possible to make unique and transferrable covariate ID in the FeatureExtraction package.
Then we can build 'transferable PLP model based on clinical and genomic features from NGS.


This topic is interesting! Just want to make sure that you are talking about the nomenclature of the genomic variance (https://varnomen.hgvs.org/), but not the gene ontology (http://geneontology.org/) per se. Please clarify or confirm.


Yes, sorry for my misapplying “nomenclature” to the HUGO gene ontology. Only HGVS is a nomenclature. My guess is that HUGO will be less of a challenge to incorporate into the OMOP vocabulary structures than HGVS.

1 Like

In addition to @mgurley 's metion, we’re talking about incorporating ontology to represent genetic variant.

What I proposed is that

  1. Adding ontology for ‘gene’ itself to the OMOP vocabulary (by using HUGO). So adding every gene to the OMOP vocabulary (about 33,000 genes).
  2. Adding several concepts how to categorize variants from HGVS

By combining the ‘gene’ and ‘category of variants’ and the ‘position’, I think we can make the unique and transferrable ID for variants from NGS data.


If you want to include class-level information about genes, and variation types, I highly recommend using the Sequence Ontology to represent them, as this is the standard vocabulary used by the main sequence organizations (NCBI, EBI, etc). http://obofoundry.org/ontology/so.html

If you want class-level information about genotypes (haplotypes, haplogroups, major rearrangements), I’d go with the genotype ontology, which also integrates well with the Sequence Ontology. http://obofoundry.org/ontology/geno.html

For gene ids, using the HUGO identifiers is a great idea, so long as you also include equivalences to NCBI and EBI, and synonyms. Those can be obtained here: https://www.genenames.org/download/custom/

If you want to standardize the variants as a vocabulary, try to use those with identifiers leveraged by the genomics community, such as in dbSNP, dbVar, or ClinVar, which have strong deprecation policies. Using the gene name and position as an identifier (such as in HGVS) is fraught with issues, as symbols become deprecated and positions change across genome versions (which itself is not part of the standard HGVS notation).

Happy to discuss options here as I’ve worked on integrating this type of data together for the Monarch Initiative previously.

1 Like


All very apropos. We are right now in the middle of figuring this out. Please come and help. We are planning to post our proposal here soon.

Has there been progress/proposal on this?

@Sanjay_Udoshi please come to the oncology WG-Genomic meetings (9-10 am EST 2nd and 4th Tuesday of the month). We are still in the middle of it and making progress. We can use all the help. :slight_smile:

1 Like

@Sanjay_Udoshi And before you go there please have a look at OMOP Genomic in Athena and the status update.

…with a great progress since 2020, I’d say!

Thank you! I will make time to attend.


1 Like