HERMES Alpha Release

Frank · December 23, 2014, 9:27pm

I have released the initial version of HERMES to GitHub. HERMES is a web based vocabulary exploration tool that depends on the recently released WebAPI. The readme file has instructions on how to setup HERMES and configure it to leverage a local installation of the WebAPI.

Vojtech_Huser · January 9, 2015, 1:44pm

Hermes is a great tool! Thank you for creating an alpha version.

I want to provide feedback:

I would like to suggest that we change the API to support case insensitive search (via new function) or change the existing function.

In the IMEDS install, if I search for “natural mother” I get no hits but if I change it to “Natural mother” - I get the desired concept http://ohdsi.org/web/hermes/index.html#/concept/4277283

I believe we could find a SQL function that would translate well to all dialects (add it to the RenderSQL R package translation matrix) to make lowercase of all search strings and do case-insensitive search.

There are also SNOMED synonyms defined (e.g., Biological Mother)
see here for alternative ((in this case (better) browser)) http://schemes.caregraf.info/snomed#!65656005

I believe we have SNOMED synonyms in the CDMV loaded (@Christian_Reich)

(unrelated question)
(what is the forum syntax to make the links in my post “trackable” (indicating number of clicks they received in a grey circle next to a link?)

(and how can I make italic more than 2 lines of text)

Frank · January 9, 2015, 6:49pm

Wouldn’t this be better addressed on the database? For example, on SQL Server you control whether or not text queries are case sensitive on a database, table or specific column.

I believe the link tracking just happens by default.

jon_duke · January 9, 2015, 10:19pm

Looking at the documentation for T-SQL, Postgres, and Oracle, they all use
the standard lower(string) function.

So why not just set the query to lowercase in Java before passing on then
change search.sql to:

select CONCEPT_ID, CONCEPT_NAME, CONCEPT_CLASS, VOCABULARY_ID
from CONCEPT
where lower(CONCEPT_NAME) LIKE ‘%@query%’ and VALID_END_DATE > GETDATE()
order by CONCEPT_NAME ASC select * where lower(con

Should work without changes to SqlRender, no?

Frank · January 10, 2015, 3:19am

Wouldn’t this be better addressed on the database?

jon_duke · January 11, 2015, 3:29pm

There seem to be three valid ways to skin this cat: 1) At the database level, 2) at the session level, and 3) at the query level.

At the database level, the approach seems to vary but for example for SQL Server you could set the COLLATION such as in this example.

ALTER DATABASE MY_CDM
   COLLATE Latin1_General_CS_AS

At the session level, for example with Oracle, you can set the NLS_SORT and NLS_COMP to create case insensitive comparisons for the session, as in this example.

alter session set nls_sort=BINARY_CI
alter session set nls_comp=LINGUISTIC

Or finally at the query level, you can just use your standard LOWER() and UPPER() functions as described in the previous message. Change the query to lowercase in Java then

select CONCEPT_ID, CONCEPT_NAME, CONCEPT_CLASS, VOCABULARY_ID from CONCEPT where lower(CONCEPT_NAME) LIKE '%@query%' and VALID_END_DATE > GETDATE() 
order by CONCEPT_NAME ASC

While I think personally that every database should be case insensitive by default, I don’t think we should rest the successful functioning of Hermes on whether or not this has been done. It varies for each dialect, may require DBA performance at some sites, and may affect other CDM users who have scripts created already for other purposes.

So at a practical level, both the database-wide and session-specific solutions seems overly involved compared to the two lines of dialect-independent code required to change this at the query level (which I would be glad to contribute if this approach is acceptable).

This will set a convention for future approaches so I figured it was worth thinking through in detail to get other perspectives.

Jon

Patrick_Ryan · January 11, 2015, 3:39pm

I agree, it seems using LOWER() as a convention within our queries whenever
we’re doing string searches seems like a reasonable convention that we can
follow and apply consistently across all the environments we’re trying to
support. The consequence will be that we can’t do case-sensitive searches,
and I’m ok with us having this ‘limitation’, because as Vojtech expressed,
assuming case-insensitivity and not getting it is more dangerous than the
other way around.

Patrick

Frank · January 11, 2015, 5:13pm

By changing the query to force a case insensitive search means that if anyone wants case sensitivity, and setup their database to be case sensitive, then we prevent it from happening. It also performs poorly compared to changing collation.

Brandon_Ulrich · January 12, 2015, 12:36pm

For what it’s worth, the following conventions are used for SNOMED CT:

Descriptions are normalized based on the SNOMED case-sensitivity properties (the children of 900000000000447004 Case significance). E.g. if the property is Initial character case insensitive, as in Vojtech’s example, then the initial character would be lowercased when storing the primary search term: Natural mother to natural mother
Textual searches are case insensitive by default. Natural mother, natural mother, and NaTuRaL mother would return the same result.
Every token that starts with two or more uppercase characters is individually treated as case-sensitive. This use case is mostly about acronyms (AIDS vs Sharing of sex aids) and pharma trade names in countries where they are capitalized (AMOX should return AMOXIL but not Amoxicillin)
Enclosing a token in quotes requires an exact match. That means that it is case sensitive (“NaP” does not match Nape) and excludes synonym expansions (breast “mass” present does not match breast lump present)

Brandon

Frank · January 12, 2015, 4:03pm

One other thought on this - there is further work to be done to improve search that suggests that we will be looking to facilitate search through a different technology, @schuemie had suggested Lucene and there are other possibilities here as well. I suggest we start by finalizing the architecture for our search implementation.

jon_duke · January 12, 2015, 4:53pm

There is probably no one else on this green earth who loves Lucene more than I do, and yes we could set up a SOLR or ElasticSearch instance and rock a new vocabulary API using that. But I think there is more value in this current stage of our software development to sticking for now with the clean example that @Frank has put forward with Hermes / WebAPI. To introduce a new technology stack, even a great one like Lucene, seems like it would cause more complexity than necessary at the moment. (I do think a time will come when we really will want to go nuts with Lucene across the entire CDM, not just the vocab, but I’m resisting for now.)

As a use-case dependent operation, the only concrete real-world issue I’m hearing is that we can’t find stuff due to 1) case matching (diabetes mellitus will not return Diabetes Mellitus) and 2) token order (type 2 diabetes will not return diabetes type 2). I think dynamic case-matching type stuff (making it so that AIDS will not return ‘Sharing of sex aids’) is a nice-to-have but not a blocker to anyone finding what they’re looking for.

So I say let’s rally the flag around a few tweaks to the WebAPI and Hermes but otherwise keep the baby in the bathwater.