OHDSI Home | Forums | Wiki | Github

Character encoding standard

(Martijn Schuemie) #1

We currently do not have an encoding standard for non-English characters, which has already led to some misunderstanding in Usagi and possibly Hermes.

I vote we standardize on UTF-8, which is what I always use, but can be persuaded differently with good arguments or beer.

UTF-8 characters not translating to vocabularies
(Juan M. Banda) #2

I second the vote for UTF-8.

(Christian Reich) #3

The database is in AL32UTF8 right now.

(Martijn Schuemie) #4

If I understand correctly from here, AL32UTF8 and UTF8 are, for all practical purposes, the same.

The CSV files I got from you were in ISO-8859-1. So I guess somewhere in your export routine there was a conversion?

(William Stephens) #5

UTF-8 for me

(Lee Evans) #6

I vote we standardize on some form of UTF for the vocabulary file format (the database already supports it).

Can someone please confirm that Sql Server can load a file in UTF-8 format and provide some examples of the bcp or bulk insert statements? Or does it only support UTF-16?

See this URL for bcp: http://msdn.microsoft.com/en-us/library/ms162802.aspx
and this URL for SQL bulk insert: http://msdn.microsoft.com/en-us/library/ms188365.aspx
where it says that “SQL Server does not support code page 65001 (UTF-8 encoding).”

If SQL Server requires UTF-16 then the impact would be that we would double the file size for downloads for all dbms users (if we wanted to produce a single file) or we would need to develop additional code to generate UTF-16 files for SQL Server and UTF-8 files for Oracle and Postgres.

(Martijn Schuemie) #7

Urgh! As far as I can see, SQL Server does not support UTF-8, and in fact dropped support for importing UTF-8 data ( https://connect.microsoft.com/SQLServer/feedback/details/370419/ ). By default, SQL Server uses ISO-1 (ISO 8859-1), which does support European characters but not Asian characters.

(Christian Reich) #8

That may be ok, Martijn. We don’t have Asian characters in there. The only characters we bump into are accented or otherwise altered Latin. So, when exporting we could produce a different character set depending what platform people choose.

(Ambuj) #9

I vote for the same. Please add UTF-8 in standard encoding concept id.

(Christian Reich) #10


@schuemie I think said he does not want UTF-8. What are you voting the “same” for?

(Ambuj) #11

well I guess what I asked is off the topic here.My concern was for Notes entity where we need to have encoding_concept_id to map with a standrard concept_id in concept table.But I cannot find any encoding concept id.can you suggest something here.
I am sorry,I am new to OMOP


(Christian Reich) #12

@ambuj: No, you are right. There is the field encoding_concept_id. Unbelievable. :slight_smile: It snuck in.

Ok, let’s do this: We create the UTF-8 concept so you can do your job, and in the mean time I will try to convince the community to drop that field. It should have never been there to begin with. They may push back with some good reason why it is needed for NLP.

How to populate _concept_id fields in CDM Note table?
(Ambuj) #13

So that means I can proceed with setting the encoding_concept_id = 0 ?
Also, if you can shed some light on notes_class_concept_id, that would be great.

Thank You.

(Martijn Schuemie) #14

(just for the record: I was advocating we adopt UTF-8 as our standard in our tools)