In 2015 the rule was introduced that any concept should have at least one synonym. And we simply use concept_name as a placeholder for concepts missing the real synonyms.
Currently, we have only 1463 concepts without the synonyms, and they’re mostly in the Metadata/Visit/Payer/Cost domains.
The thing is the approach is not consistent and the current picture is:
- 7,693,299 concepts have only one synonym that is an imputed concept_name;
- 1,073,940 concepts have >1 synonym, where one of them is imputed or match the concept_name;
- 239,314 concepts have one or more synonyms, but none of them is an imputed concept_name.
Because of that, the search results of algorithms used in Athena or other tools may be affected: additional match within the synonyms will increase the matching score, but, in fact, the synonym is imputed.
Here is the proposal to be implemented:
- Do not impute the synonyms. Leave the concepts without the synonyms if the source doesn’t provide such.
- Do not allow the synonyms that match the concept_name, except they’re in the national language (language_concept_id <> ‘4180186’ English language).
- Drop the existing synonyms according to these rules.
Once this implemented, there is the only possibility to string-search the concepts within both tables (concept + concept_synonym), while currently false confidence that concept_synonym is enough may exist.
We’d like to hear from the community how this could affect the string-search you used, especially withing OHDSI tools (Athena, Atlas, USAGI).
Tagging @Chris_Knoll @anthonysena @Yaroslav @schuemie @MaximMoinat @acumarav @Christian_Reich @Dymshyts