Creating a registry of custom concept_ids (2-Billionaire Club) to avoid collisions across networks

kzollove · February 7, 2024, 4:01pm

Background

Custom concept_ids can be created using the identifiers greater than two billion. Due to the int4 data type restriction on the concept_id field, there is an upper limit of ~2.175 billion for custom IDs (read: 175M available custom identifiers). Custom concept_ids can be a great temporary solution to using vocabulary terms that will be added to the official vocabulary in a future release, or a full solution to use terms that are too niche to ever be made official.

If you are using custom concept_ids across a network, a management issue arises: avoiding collisions.

For example, if you are participating in an OMOP network study or federated research project (e.g. N3C, CureID, Oncology Maturity Sprint, and many more), you may be required to add a list of custom concepts with concept_ids in the greater-than two billion range to your vocabulary.

If only one of the studies/projects in which you participate is disseminating custom concept_ids, the story ends there.

However, if you participate in multiple studies/projects that are providing you with a list of custom concept_ids (who likely all started their numbering at 2,000,000,001…) what ensures that there is not overlap across these lists of custom concept_ids? You may quickly grow tired of managing numerous custom concepts, switching between vocabularies or adding and deleting concepts for multiple monthly data extractions… but what else is there to do?

Proposal

@Daniel_Smith, @rtmill, and I are proposing a registry (or address book?) of custom concept_id blocks.

Studies and projects that want to disseminate custom concept_ids can reserve a block (neighborhood?) in the two-bil concept_ids.

Since projects do not typically need more than 100K custom concepts, this could be managed in a very simple, public-facing list:

Example:

Project Name	Block Range
General Use	2.0000-2.0001
[empty]	2.0001-2.0002
[empty]	2.0002-2.0003
N3C Custom Concepts	2.0003-2.0004
[empty]	2.0004-2.0005
CureID	2.0005-2.0006
CureID	2.0006-2.0007
Oncology Maturity Sprint	2.0007-2.0008
[empty]	2.0008-2.0009
[empty]	2.0009-2.0010
…	…
[empty]	2.1748-2.1749
[empty]	2.1749-2.1750

In this hypothetical, N3C vocabulary developers would restrict themselves to creating and disseminating custom concepts with IDs 2,000,300,001 to 2,000,400,000. The CureID project, which reserves two slots, could use concept_ids 2,000,500,001 to 2,000,700,000. Some number of IDs at the start of the list could be blocked off for “general use”, keeping the early 2bil concept_ids open for local vocabulary development.

This doesn’t solve nearly all the issues with custom concept_ids, but it does start to ease their use across large studies and projects, ultimately allowing for faster study deployment and more rapid development of tools.

Thoughts?

@aostropolets @MPhilofsky @Christian_Reich @Andrew

Daniel_Smith · February 7, 2024, 4:25pm

Thanks for getting the conversation started on the forums Kyle! A couple thoughts to add

Maintaining sequential blocks via including buffers

I think we’d want to allow a small buffer between 100k “neighborhoods” just in case a project goes over a little and didn’t foresee the increased need. Given the amount of space available for 100K neighborhoods in 175M codes, I’d think that a small buffer of 200k between projects should allow for a project to maintain sequential numbering.

Including “date added/modified” type columns as well as contact info of initial maintainer teams for the project’s likely concept_id assignments.

This would allow us to remove reservations for older neighborhoods or from inactive developers. If someone maintaining a custom concept neighborhood for a project no longer needs that neighborhood because the vocab has been integrated into the vocabularies as a community contribution, then the range could be released, but tracked over time.For example, maybe N3C will soon be able to integrate key concepts related to SARS-COV-19 into an official community contribution, but they’ll have their range until then.

Mark · February 7, 2024, 4:26pm

Then they are not truly custom concepts. Seems you are trying to do something similar to the blocks of ports defined in tcp/ip. If that is what you are trying to do, this is a big ask and even with the registered/assigned ports, there is a range of truly custom ports.

Should you decide to go down this route, I would advise looking at IANA for how they register ports. What ever is done now is is most likely break current use of custom concepts and create pain. If this was to be done, it should have been done at the inception, not this late in the game.

EDIT:
The OMOP structure is in danger of becoming a swiss army hammer… and we see how much pain this created with XML and forced the creation of new protocols.

Daniel_Smith · February 7, 2024, 4:54pm

Thanks for your perspective and references @Mark !

I don’t think the proposal is to have a registered group for long. Just while a site works on several projects at the same time with unofficial vocabularies across each project. With the example given by Kyle, across different Emory groups (my institution), we’re using N3C, Oncology Maturity sprint, and local concepts unlikely to be useful elsewhere. We personally have to be aware of what standards the different projects use, and could run into problems if we have the same concept_id across projects. I can certainly initiate a pipeline to ingest these different concepts into my CDM across the different projects, but what happens when this data should get pushed to a network study? We’ll have to possibly readjust for the specific project. Definitely doable, but might be painful to maintain within an ETL pipeline, and for a CDM instance hopefully contributing to network studies.

Ideally, if useful at the network level for an extended period (e.g., greater than 1 or 2 years), the vocabulary of custom concepts would have a plan for a community contribution into the OMOP vocabularies, or if it’s been around long enough, maybe the vocabulary in the custom concept range does not have enough use to justify either a continued registration, nor be clearly planned for community contribution.

MPhilofsky · February 8, 2024, 3:28pm

I like the idea, @kzollove! Also, the title is clever

From the Healthcare Systems Interest Group (HSIG) point of view:

This would be very beneficial for all of us (lol, unintended pun) who create custom concept_ids for our uncoded data. Have you heard about the work HSIG is doing this year? We are creating a repository for the OHDSI community to contribute their source text string to standard concept_id custom maps for reuse by the community. If your organization allows, we would like you to contribute these mappings to our repository. Sharing is caring! We are in the beginnings stages of this work, so we don’t have a process quite yet. Also, we’d be happy to have your input on metadata needed for the custom concept contributions. Should we collaborate?

Also, we need a landscape assessment to see what concept_ids > 2 billion exist out in the wild before “reserving” groups of concept_ids for specific projects.

From the Themis WG pov:

It looks like you want to put in place some guidelines, rules or conventions. The Themis WG provides direction when there is ambiguity on how or where data should be inserted into the OMOP CDM. We offer guidance and rules when there is ambiguity on what data belong where, how to derive data when the source lacks a required data element, which concept to chose when there is >1 standard concept with the same meaning, and general guidelines on all source to CDM questions. Once you complete this initial fact gathering, please reach out to Themis to start the ratification process.

Also, we need a landscape assessment to see what concept_ids > 2 billion exist out in the wild before “reserving” groups of concept_ids for specific projects.

From the University of Colorado’s pov:

Yes, this is a great idea! We participate in N3C and our likelihood to participate in other research utilizing the OMOP CDM is very high. We have already created 1000’s of custom, concept_ids. I will review our concept_ids to see how our integers compare to the proposed chart. You’ll probably want to survey the larger OHDSI community to ensure you aren’t creating any collisions.

From the Vocabulary WG & the rest of OHDSI pov:

Have you thought about creating a vocabulary from these source values? The Vocabulary team has created a community contribution process for similar use cases.

Christian_Reich · February 8, 2024, 3:40pm

Friends:

This is a good initiative. But we should make distinctions:

I can see 5 categories here:

Source concepts which are truly local - no need to reserve space.
Source concepts with mapping to standard which are shared amongst sites - should be in the Vocabularies.
Source concepts not mapped to standard but shared amongst sites - could be a candidate for the range blocking. But why are they not standard?
Standard concepts for use cases the community has - should be in the Vocabularies.
Standard concepts for use cases outside OHDSI - why would we have those?

In other words: If concepts are not local and if they are needed - they should go in properly. Otherwise we are creating Standard Vocabularies outside the Standard Vocabularies. However, if they are not ready and you want to do experiments first - makes sense to create reserved spaces. I think that is what @Daniel_Smith says. A 500k concept name space should carry us for a while.

MPhilofsky · February 8, 2024, 3:54pm

Wait! I think I misunderstood this effort.

@kzollove @Daniel_Smith

Are you going to make these > 2 billion concept_ids standard? I answered from the idea you were creating > 2 billion concept_ids to map to standard concept_ids. If you want to make these 2 billionaires standard, then what @Christian_Reich said is applicable.

Mark · February 8, 2024, 4:12pm

A much more elegant way of saying what I was attempting to.

Daniel_Smith · February 8, 2024, 5:04pm

@MPhilofsky , thanks for the variable perspectives shared and providing feedback in context. That’s very helpful!

@MPhilofsky, I won’t speak for @kzollove and @rtmill, but I agree with @Christian_Reich’s assessment of my intention, that a name space would be useful during experimentation en route to a community contribution as an end state, ideally.

Again, won’t speak for Kyle and Robert, but I believe the intention is to create experimental standards under a new vocabulary in the 2 billion range.

At Emory, we still use the source-to-concept-map table for ETL pipelines, given the Corewell scripts for Epic currently support this table without modification. All of the STCM mappings we have locally have standard vocabularies in V5 as standard concepts. But, in the case of N3C, and the new ICDO3 concepts we’re playing with as part of the Oncology Maturation Sprint, these would be members of the 2-Billionaire Club.

That would be great, and we can do the same assessment at Emory. I think 500k at the start of the 2B (and for each project thereafter) would be more than enough buffer for local mappings that choose the concept and concept_relationship tables as you’re currently doing at University of Colorado, over source-to-concept-mapping table as we’re doing.

Daniel_Smith · February 8, 2024, 5:09pm

@clairblacketer , pinging you here for potential assessment and feedback, given your observations at the recent Technical Advisory Board meeting, discussing the challenges with reserving in int8 ranges, while most tools (e.g., ATLAS) only support concepts in the int4 range.

Eduard_Korchmar · February 8, 2024, 10:53pm

To approach this problem in a technical sense, I see two possible possible absolutes (with end decision lying anywhere between them, including):

Centralization absolute:
If a custom concept is needed at more than one site, it needs to be in OHDSI Athena and maintained centrally, even if license-gated. Otherwise, the whole idea of single central Standard Vocabulary hierarchy becomes pointless. Furthermore, if a concept collection is shared between CDM instances, that means it is, by necessity, mapped, maintained and deduplicated at least at some level, making it a candidate for Vocabulary inclusion.
Decentralization absolute:
Why do not go a step further and not implement a protocol for having a source for concept collections outside centralized Vocabulary distribution? I would argue that in technical sense challenges we face here are exactly the same as to ones with concept ranges reservation. Then, OMOP CDM users would choose to download standardized vocabularies from different Athena deployments, specified in network study protocol and merge them.

On a completely different (also a little crazy) note: all of officially provided DDL scripts on OMOP CDM github, for every popular DBMS, specify range for concept_ids as a signed 64-bit integer. Signed in this case means positive-negative signed. Negative concepts from -2 (avoiding usual sentinel -1) to -2billion are technically possible and indistinguishable for DBMS from positive ones. If we want to give away concept ranges by millions, we will have to look into this, because 21 possible ranges above positive 2bil (without considering buffers) are not enough.

Daniel_Smith · February 9, 2024, 6:08pm

Thanks @Eduard_Korchmar! Continuing with these absolutes:

My initial impression thinking about #1 is that we would need to work with the vocabulary working group to figure out a method of speedy community contribution for experimental vocabulary builds? At this stage I don’t think N3C can be considered experimental, but I think I can speak for both Robert and Kyle on this one that they would not want the oncology maturity sprint to be in Athena as of yet until we have more data that it solves some of the issues for observational research in oncology using the OMOP CDM.

Decentralization absolute: I don’t think we want this scenario, as it is a long-term breaking case for vocabularies and makes it untenable to always have the two. Christian previously brought up that point in one of the conversations regarding the creation of an experimental paradigm for a vocabulary, as did Mark and Christian above. I think that the argument for maintaining a single centralized source is important. As with my thoughts for #1, having some process for the sharing of experimental vocabularies, or immediate use vocabularies required to solve a problem quickly (i.e., N3C custom concepts), will be needed if we want to shy completely away from decentralization, and reside instead in absolute centralization. As you say, we do not need to live in that absolute, and will likely end up existing in the grey, but just to continue with the “absolutes” example.

On my side, we are using the redshift DDL which specifies signed 32 bit integers (int4). this is actually what started me down this path! We were locally doing some custom concept reservations to steer completely away from N3C and other 2-billionaire club members by converting the DDL’s to have int8 (signed 64-bit) where there were int4’s, and giving local custom concepts that we didn’t want to contribute upward to OHDSI a 4billion+ designation. This doesn’t break some tools, like HADES which can handle codes up to 2^64 currently, but it will break other tools, such as ATLAS (which is how Emory discovered this problem!). Wondering if there is a roadmap to incorporate signed 64-bit into the redshift ddl’s and other tools? May be a question for the CDM WG!

Edit: clarification on our build and where errors occurred for [x]-bit integers

Eduard_Korchmar · February 9, 2024, 10:05pm

Vocabulary WG was discussing exactly that on a call this very Tuesday! The problem with current community contribution model is there is no community contribution culture, which would drive a purposeful development of a community contribution model. This is a kind of a chicken and egg problem: nobody can contribute to vocabulary because there is no established process nor toolset, and no process can be established until contributions become a pattern. But we are working on a solution.

aostropolets · February 14, 2024, 12:02am

Well, I wouldn’t say we don’t have an established process. You can refer to the documentation and formalized process on Github, poster or slides and slides that correspond to community talks.

That being said, we want more simple contributions like fixes in domains or mappings, and baaaadly need addition of mappings contributions. What you talk about is may or may not be simple contribution (community contribution part I), which makes a big difference. If it is not, there is no “speedy” way, only “slow” (which is much better than “no way”). Why? Because the person making contribution needs to learn about vocabularies, dev process and QA. If you know that or have a person who knows that then it will be faster.

Daniel_Smith · February 27, 2024, 9:12pm

Cross posting this issue, pertinent to the adoption of a C/CR process in light of coordination with custom concepts across several institutions:

Methodology for Converting Source-To-Concept-Map to Concept/Concept_Relationship (2 billionaires) - CDM Builders - OHDSI Forums

Daniel_Smith · March 25, 2024, 2:36pm

@roger.carlson pinging you on this topic, as sometimes “2billionaires” have run into collisions if you’re working across projects

roger.carlson · May 13, 2024, 5:04pm

@Daniel_Smith Thanks for bringing this to my attention. Only now getting a chance to look into it.

roger.carlson · May 13, 2024, 5:16pm

This issue is a little beyond the scope of my post. I was only considering the mapping of local codes to standard concepts. If I’m following this thread correctly, the concept_ids in the 2Billionaires would need to be preserved. Is that correct?

My assumption is that if you only intend to maintain local mapping, the concept_ids would NOT have to be conserved to be able to use the local codes in Atlas. In other words, every time the mappings are refreshed, the 2Billionaires can be wiped and re-written with different concept_ids.

If these are not and never can be standard codes, it doesn’t matter what the concept_id is.