Jackalope: Making new standard concepts ad hoc – and keeping compatibility

Eduard_Korchmar · October 31, 2022, 11:13am

Jackalope

Hello! Our team at Sciforce had developed a new tool, that allows to create users to create their own standard concepts in SNOMED hierarchy using SNOMED CT Compositional Grammar, without breaking compatibility for network studies. The tool is meant to be applied at ETL stage to increase mapping coverage for important concepts that do not have a precise counterpart – and to guarantee internal compatibility of custom Standard concepts through automated evaluation against SNOMED hierarchy. You can not use these concepts to define concept sets, but they will be placed “correctly” in the CONCEPT_ANCESTOR table, meaning you can define concept sets as usual, and have these new concepts included for you.

Jackalope was pre-presented on OHDSI APAC and Health Data Interests WG community call, as well as a poster presentation at OHDSI '22 Symposium.

The tool is obviously open-source. We are looking to make Jackalope if not standard, than usual and accepted solution for OMOP CDM ETL, and we look for community input.

How to participate:

The most important step for development of Jackalope now is to live-test it with actual data. This would achieve the following:

You get better coverage of mapping of your source data to OMOP CDM.
We get to improve Jackalope, bringing it closer to real-world application use case.
Entire OHDSI community benefits from the introduction of a brand new approach to Standardization of data.

If you are not sure if ETL approach of Jackalope is applicable to your data, you can help us a lot by providing any number of examples of concepts that you could not map to OMOP CDM, and had to deal with loss of precision, or making other concessions.

Of course, contributions in form of convention suggestions or commits to Jackalope source code are also welcome.

If you want a better idea of how is everything implemented, unfold the collapsed text below. Also, community calls & symposium links contain a lot of information.

FAQ is displayed on click.

Q: Why not just map everything to existing Standard concepts?
A: Mapping to Standard concepts is a reliable and a correct way to represent source data in OMOP CDM – in absolute majority of cases. That is why these concepts are Standard. However, in some cases (e.g. new technologies, domain-specific terminology or local specifics), mapping to existing Standard concepts may lead to a loss of precision.

Q: Which concepts can be standardized through Jackalope?
A: There are two answers to this question: a mathematically correct one and a honest one. In mathematically correct and practically useless sense, every imaginable thing can be post-coordinated through arbitrary number of SNOMED sub-expressions, which can be all reliably evaluated through Jackalope to obtain a consistent result. Realistically, you would use Jackalope to obtain relatively simple refined versions of already existing concepts from SNOMED subhierarchies (e.g. combining of diabetes mellitus with a specific complication or creating a missing medical imaging procedure by combining a specific modality, topography and other details). Full potential of the Jackalope lies somewhere in-between.

Q: How does it work, exactly?
A: SNOMED Expressions are evaluated against SNOMED RF2 source, trying to find if not exact matches, then a set of close semantic parents. Unless exact match is found, new Standard concept is then created (with VOCABULARY_ID of Jackalope and DOMAIN_ID inherited from parents) and placed in the existing SNOMED Vocabulary hierarchy. The expression itself is preserved as an entry in CONCEPT_SYNONYM table for both source and Jackalope concept, for future re-evaluations.

Q: Can we use post-coordination, multiple "Maps to" or manually curated local standard concepts instead?
A: Yes, but Jackalope has multiple benefits compared to this approach, that mainly have to do with compatibility.

Any relationship built from SNOMED expressions can be re-evaluated every SNOMED release. Jackalope will even find an exact Standard match, if it gets added.
Expression, written once, will not break "silently" like a partially deprecated multiple mapping would.
Synonymous expressions will get coordinated to the same entity and even get assigned the same CONCEPT_CODE, trivializing deduplication between different OMOP CDM instances.
SNOMED's compositional grammar has a maintained standard spec, and FACT_RELATIONSHIP or manual standard concepts are unmaintainable "crutches".

Q: How to write expressions for Jackalope to evaluate?
A: Full disclosure: it is neither common knowledge, nor an easily automatable task. There is a guide on this released and maintained by SNOMED authoring organization (link hub), and there are commercial tools for authoring, and even published research into using NLP for automation of the process. We look into developing similar toolset as a part of Jackalope, but it is lower on priority list. For our testing, we wrote them fully manually. It may take up to 15 minutes to write an expression for a concept for a person familiar with SNOMED CT internals.

Q: What are specifications for Jackalope implementation?
A: We tried our best to separate interface from implementation. We have separate documentation describing Jackalope as ETL process (what should be a shape of an input, expected output and process rules), and we would love to host them on the official OHDSI resources.

Q: How does deduplication work?
A: SNOMED internal logic allows to establish semantic isomorphism of differently phrased expressions through the concept of canonized normal form. Textual representation of this form is ran through BLAKE-2b hashing algorithm to obtain a 25-byte length hexadecimal value, that is then placed in CONCEPT_CODE field. Any two concepts that are generated with the same CONCEPT_CODE are synonyms with the probability of 1-16^(-50).

Tagging people who shown interest or participated at various stages of development:

@Polina_Talapova @MPhilofsky @mvanzandt @Christian_Reich @cgchute @Agota_Meszaros @mikecjohn @willhalfpenny @Alexdavv @mari.kolesnyk

Andy_Kanter · November 4, 2022, 3:20pm

Eduard, this is very interesting… I am not quite sure that I fully understand what it does or how having a new concept works within OHDSI. I think I get how you are able to take a SNOMED expression and translate it into a single hashed concept that retains the maps to the individual standard concept codes. However, are these newly created codes usable in analytics or only in the original ETL?

For example, CIEL and IMO have source concepts which are mapped to multiple SNOMED and ICD codes. An enterprise that has them can already populate OMOP using multiple maps from a single source term. However, once in OMOP those concepts are independent.

For example “Food sensitivity headache” maps to the SNOMED concept for “Headache” and to “Propensity for adverse reactions to food”. What would appear in the database? I presume there would be the same source term/code for Food sensitivity headache with two different SNOMED codes as OHDSI standard concepts. Where does the new hashed code go, and how could I use it to find headaches caused by food (as compared to any other symptom recorded along with the “propensity for adverse reactions to food” SNOMED?

So I am already starting with a shared pre-coordinated term and want to instantiate the term in OMOP where I can retain the standard codes and their mapping to the single entity… Is this what Jackalope does?

Eduard_Korchmar · November 4, 2022, 4:22pm

The way Jackalope works is by creating a new local Standard concept as a mapping target for the original source concept. This concept then gets placed in the hierarchy of SNOMED vocabulary by being evaluated against semantic attribute definitions of similar concepts. It gets corresponding entries in CONCEPT, CONCEPT_RELATIONSHIP and CONCEPT_ANCESTOR tables.

As this will be a local concept, there is yet no easy and convenient way to make it work in network studies – it can get included in concept sets, since ATLAS concept sets are defined by SNOMED parents and abstracted from actual descendants. Of course, eventually ATLAS may add possibility to share custom concepts in automated way – but there is no precedent. Somebody has to “buy the first telephone”.

So in your example, I would coordinate the concept as follows:

===25064002 | Headache | + 418471000 | Propensity for adverse reactions to food |: 42752001 | Due to| = 414285001 | Allergy to food |

As neither Headache nor Propensity… concept have rich hierarchy, the concept built from expression will most likely just get assigned both of these concepts as parents. If we use “Allergy to food” instead, it may become a descendant of 4448006 Allergic headache instead. But allergy vs. propensity to allergy is a whole another discussion.

To look for it, you would write a query (or a concept set) to look for all entries in CONDITION_OCCURENCE that are hierarchical descendants of both of 4303802 Propensity… and 378253 Headache concepts. I think having branching “Maps to” is actually more complicated.

Andy_Kanter · November 29, 2022, 2:54pm

I am starting to look more closely at this. The challenge I heard on the vocabulary call recently was that building a cohort definition which requires the AND between the SNOMED codes for a particular condition is not easy to create in Atlas. When looking for conditions which have both terms you get a large number of patients who have a descendent of one or the other without necessarily having the same condition having both. One can argue that it is more robust to create cohort definitions that are built up of multiple hierarchies, but when I am looking only for diseases which affect one structure, or specifically a disease that has a single type of complication, it would seem to be much easier to just connect the SNOMED codes in the instance data. Why not create the cohort definition by using the local standard code? I presume the reason is that no-one else has that code. What if they could? This is especially true if the source data already was pre-coordinated. Are the jackalope concepts defined locally by some means? For example a prefix code or source ID that specifies the standard code as one locally-maintained?

Eduard_Korchmar · November 29, 2022, 7:07pm

To answer in order:

Jackalope will probably will not be useful just to fix Atlas queries. It is applied at ETL stage, to create a new standard concept. If you have to “split” an entry in, say, CONDITION_OCCURENCE table to accomodate two or more different Standard concept_ids for the same event, and then try to glue them back together in ATLAS – well, that can be fixed with Jackalope. But changes have to be introduced at ETL stage.
If you and your colleague at a different institution use Jackalope to encode the same concept, Jackalope will generate an entry with the same CONCEPT_CODE, as long as expressions you wrote are synonymous in SNOMED semantics. It is a very basic safeguard against logical duplication while avoiding data sharing completely, and as Jackalope gets more widespread use, new functionality will have to be added. But as of right now, if you want to keep network study compatibility, probably use of Jackalope concepts should be avoided. I think ATLAS does not currently have a way to include concepts that are descendants of all of arbitrary number of concepts – but it can be done manually. Or perhaps, definition of such concept sets actually should be a part of what Jackalope offers, and can then be imported into ATLAS. I’ll look into it.
Jackalope concepts are currently created with custom vocabulary_id = “Jackalope”, have concept_code that contains hash of their semantic form and simple sequential concept_id in space between 1billion and 2 billion (with vocabulary concepts being below, and fully manual concepts starting above). It is not yet an ‘official’ OHDSI convention, and may be subject to change, but I think it is a good solution as us. It may be a better idea to also assign concept_id as hash, so that same concepts created independently get the same concept_id through the magic of hash function, but I do not have math try to prove that it is statistically impossible to get two different semantically expressions in range just 1 billion long.