What is the recommended collation for Microsoft SQL server?

Moving this to the forums instead of the CDM github

I came across a problem when loading the vocabularies (CONCEPT_SYNONYM):

Violation of UNIQUE KEY constraint 'uq_concept_synonym'. Cannot insert duplicate key in object 'omop.concept_synonym'. The duplicate key value is (9188, final report, 4180186)

After some searching the duplicate entries where:

9188	Final report	4180186
9188	final report	4180186

This lead me to the problem that my SQL server is Case Insensitive (collation = SQL_Latin1_General_CP1_CI_AS).
After changing to SQL_Latin1_General_CP1_CS_AS, this problem goes away.

In the book of OHDSI the only thing I found is:

Case Sensitivity in String Comparisons
Some DBMS platforms such as SQL Server always perform string comparisons in a case-insensitive way, while others such as PostgreSQL are always case sensitive. It is therefore recommended to always assume case-sensitive comparisons, and to explicitly make comparisons case-insensitive when unsure about the case. For example, instead of (…)

Is there a recommended collation to use when setting up new CDM databases for our clients (when using SQL server)?

  • github user w0ut0
3 Likes

Sorry to resurface an old thread, but I’m curious if there was ever a decision here.

During ETL we noticed that there are duplicates in the Concept Synonym vocabulary table where the only difference was case sensitivity. Example: Concept ID: 948754 These could easily be eliminated on our end with a “SELECT DISTINCT…” or “GROUP BY…”, but wasn’t sure if eliminating the differently cased rows would cause issues in downstream analysis.