Background
Custom concept_ids can be created using the identifiers greater than two billion. Due to the int4 data type restriction on the concept_id field, there is an upper limit of ~2.175 billion for custom IDs (read: 175M available custom identifiers). Custom concept_ids can be a great temporary solution to using vocabulary terms that will be added to the official vocabulary in a future release, or a full solution to use terms that are too niche to ever be made official.
If you are using custom concept_ids across a network, a management issue arises: avoiding collisions.
For example, if you are participating in an OMOP network study or federated research project (e.g. N3C, CureID, Oncology Maturity Sprint, and many more), you may be required to add a list of custom concepts with concept_ids in the greater-than two billion range to your vocabulary.
If only one of the studies/projects in which you participate is disseminating custom concept_ids, the story ends there.
However, if you participate in multiple studies/projects that are providing you with a list of custom concept_ids (who likely all started their numbering at 2,000,000,001…) what ensures that there is not overlap across these lists of custom concept_ids? You may quickly grow tired of managing numerous custom concepts, switching between vocabularies or adding and deleting concepts for multiple monthly data extractions… but what else is there to do?
Proposal
@Daniel_Smith, @rtmill, and I are proposing a registry (or address book?) of custom concept_id blocks.
Studies and projects that want to disseminate custom concept_ids can reserve a block (neighborhood?) in the two-bil concept_ids.
Since projects do not typically need more than 100K custom concepts, this could be managed in a very simple, public-facing list:
Example:
Project Name | Block Range |
---|---|
General Use | 2.0000-2.0001 |
[empty] | 2.0001-2.0002 |
[empty] | 2.0002-2.0003 |
N3C Custom Concepts | 2.0003-2.0004 |
[empty] | 2.0004-2.0005 |
CureID | 2.0005-2.0006 |
CureID | 2.0006-2.0007 |
Oncology Maturity Sprint | 2.0007-2.0008 |
[empty] | 2.0008-2.0009 |
[empty] | 2.0009-2.0010 |
… | … |
[empty] | 2.1748-2.1749 |
[empty] | 2.1749-2.1750 |
In this hypothetical, N3C vocabulary developers would restrict themselves to creating and disseminating custom concepts with IDs 2,000,300,001 to 2,000,400,000. The CureID project, which reserves two slots, could use concept_ids 2,000,500,001 to 2,000,700,000. Some number of IDs at the start of the list could be blocked off for “general use”, keeping the early 2bil concept_ids open for local vocabulary development.
This doesn’t solve nearly all the issues with custom concept_ids, but it does start to ease their use across large studies and projects, ultimately allowing for faster study deployment and more rapid development of tools.
Thoughts?