Representing germline & somatic variants in OMOP — who's actively using the Genomic CDM extension?

Hi all,

Hoping to compare notes with folks who are operationally running the OMOP Genomic CDM extension (or a parallel schema) for translational genetics work — i.e., starting from a gene target, characterizing the phenotypic spectrum of variant carriers in EHR/claims data, identifying “human knockout” analogs of pharmacologic inhibition, and refining endpoints for genetically-defined patient populations.

OMOP gives a clean substrate for the clinical half of that. The genomic half is where I’d love to learn how this community is actually shipping it.

A few things I’d like to compare notes on:

  1. Variant representation. For germline variants, HGVS + ClinVar IDs alongside the OMOP Genomic Vocabulary seems like the default — but how are people handling somatic calls, structural variants, and CNVs in the same schema? Are they stored in measurement with custom concept IDs, in an extension table, or out of the CDM entirely with a join key?
  2. Linkage at scale. For those linking biobank-scale genotype data (UKBB, AoU, institutional biobanks) to OMOP EHR — what’s the practical pattern? A person_id ↔ subject_id crosswalk plus a parallel variant store? Anything that’s actually survived an analyst handoff?
  3. Phenotype-genotype cohorts. When you define a cohort by genotype (e.g., LoF carriers of gene X) and then ask OMOP for downstream conditions/labs/meds, what tools are you using? Atlas + a pre-filtered person list, or something more bespoke (HADES, custom SQL)?
  4. Genomic CDM extension status. Is the Genomic CDM extension the right place to converge, or have most groups built parallel schemas? I see vocabulary-side progress (HPO WG, etc.) but less visibility into who’s running it end-to-end in production.

If any of this resonates — even partially — happy to start a thread, set up a small working call, or just trade notes async. Would also be glad to raise this at the Global Symposium if there’s appetite.

Thanks!

@Shicheng_Guo:

Wonderful set of questions. You hit the nails on the heads. Also, you are asking for the analytical use case, which is exactly how we should come up with this. Has anybody done it? Not much. Somehow, either egg or the chicken is missing. We need to get over that hoop.

In particular:

  • The OMOP Genomic vocab was born in oncology, that’s a collection of somatic variants with relevance to cancer. To the chagrin of people with the syndromic and rare disease use cases and their quite different germline variants. We need to close that gap.
  • The OMOP Genomic vocab is based on Clinvar and HGVS, so we can treat those as interchangeable if done right.
  • Copy numbers: we don’t have those. We need them. I don’t know what people do, if anything.
  • Loss of function: There is no demand or development right now as far as I know.
  • Linking: I don’t see any difference to any other data source linkage. Everything should end up on person_id, so the queries are fast.
  • Phenotypes (well, genotypes, really): You can use Atlas if you have the variant in an OMOP table, for example, MEASUREMENT. Of course, it’s a litle bumpy since germline variants don’t really have a time stamp that makes any sense, but you can just ignore that and pick anytime.

Bottom line: We need to create all those conventions. I don’t think it matters that much what they are, as long as people start populating data tables and start doing research. How about starting with the Rare Disease WG?

1 Like

This resonates with me.

Asieh Golozar did a really great job on highly complex data. Genomics isn’t easy and needs extending, but this will need a community effort. She developed a R script called Koios which reads a vcf file into OMOP. It doesn’t support a maf file (I think). GitHub - OHDSI/Koios: Tool to identify concept in the OMOP Genomic vocabulary from VCF and other files as well as HGVS notations · GitHub

Recently an Italian group, extended (???) it to KOIOS-VRS https://www.biorxiv.org/content/10.64898/2026.02.09.702490v1. The code is on githun

I have not checked whether these support GA4GH standards including cat-VRS, or align with ESMO guidelines on genomics, are compatible with 1+MG GDI /EHDS or compatibility between FHIR-OMOP genomics.

There is a (possibly???) difference in support for Sequence Ontology between them. SO is embedded into FHIR/HL7 genomics but not in OMOP (as far as I know) which treats genomics measurements independently. OMOP has relationships between concepts (eg protein is translated from mRNA, mRNA is transcribed from genomic DNA etc which links protein-mRNA-DNA) eg BRAF V600E (Athena. ohdsi. org concepts 19596981 or 35981340 sorry couldn’t add links… the forum will only allow 2 URLs in a message) This might complicate PGRS analysis, gene set or signature calculations but I haven’t tested it.

Happy to connect if you are thinking about this…

Thanks everyone for the thoughtful responses and for sharing your experiences.

One theme I’m hearing consistently is that while OMOP provides a strong framework for clinical phenotyping, many groups are still maintaining genomic data in parallel infrastructures rather than fully representing variants within the core CDM. The practical pattern seems to be a linked architecture: OMOP for phenotypes, outcomes, labs, medications, and healthcare utilization, combined with external genomic stores for variant-level data and annotations.

It’s also interesting that the implementation approaches vary substantially depending on the use case—ranging from cohort discovery and translational research to oncology-specific workflows and biobank-scale analyses. That suggests the community may still be converging on best practices rather than a single dominant model.

For my use case, the key question remains how to operationalize gene-to-phenotype analyses at scale: starting with carriers of LoF variants in a target gene, systematically characterizing their clinical spectrum, and using those insights to inform target validation and endpoint development.

I’d be very interested in continuing the conversation with anyone actively working in this space. If there is enough interest, it may be worthwhile to organize a small working discussion around genomics-enabled translational research on top of OMOP and compare implementation patterns across institutions.

Thanks again for the insights—this has been extremely helpful.

Based on the discussion, I’m thinking about framing a small 2-week POC around one gene, one variant class, and one OMOP-linked phenotype analysis: define LoF carriers in a parallel genomic table, map them to person_id, then use OMOP clinical domains to characterize diagnoses, labs, medications, and potential endpoints. The goal would not be to solve genomic representation fully, but to test a practical analyst-handoff pattern and identify where the Genomic CDM extension helps versus where a linked external variant store is still necessary. Would others find this useful as a concrete community test case?