Character encoding standard

schuemie · January 7, 2015, 12:56am

We currently do not have an encoding standard for non-English characters, which has already led to some misunderstanding in Usagi and possibly Hermes.

I vote we standardize on UTF-8, which is what I always use, but can be persuaded differently with good arguments or beer.

Juan_Banda · January 7, 2015, 1:10am

I second the vote for UTF-8.

Christian_Reich · January 7, 2015, 2:41am

The database is in AL32UTF8 right now.

schuemie · January 7, 2015, 7:21am

If I understand correctly from here, AL32UTF8 and UTF8 are, for all practical purposes, the same.

The CSV files I got from you were in ISO-8859-1. So I guess somewhere in your export routine there was a conversion?

wstephens · January 7, 2015, 12:53pm

UTF-8 for me

lee_evans · January 7, 2015, 11:01pm

I vote we standardize on some form of UTF for the vocabulary file format (the database already supports it).

Can someone please confirm that Sql Server can load a file in UTF-8 format and provide some examples of the bcp or bulk insert statements? Or does it only support UTF-16?

See this URL for bcp: http://msdn.microsoft.com/en-us/library/ms162802.aspx
and this URL for SQL bulk insert: http://msdn.microsoft.com/en-us/library/ms188365.aspx
where it says that “SQL Server does not support code page 65001 (UTF-8 encoding).”

If SQL Server requires UTF-16 then the impact would be that we would double the file size for downloads for all dbms users (if we wanted to produce a single file) or we would need to develop additional code to generate UTF-16 files for SQL Server and UTF-8 files for Oracle and Postgres.

schuemie · January 8, 2015, 2:11am

Urgh! As far as I can see, SQL Server does not support UTF-8, and in fact dropped support for importing UTF-8 data ( https://connect.microsoft.com/SQLServer/feedback/details/370419/ ). By default, SQL Server uses ISO-1 (ISO 8859-1), which does support European characters but not Asian characters.

Christian_Reich · January 8, 2015, 3:40am

That may be ok, Martijn. We don’t have Asian characters in there. The only characters we bump into are accented or otherwise altered Latin. So, when exporting we could produce a different character set depending what platform people choose.

ambuj · February 12, 2019, 2:47pm

I vote for the same. Please add UTF-8 in standard encoding concept id.

Christian_Reich · February 13, 2019, 4:48pm

@ambuj:

@schuemie I think said he does not want UTF-8. What are you voting the “same” for?

ambuj · February 13, 2019, 5:13pm

well I guess what I asked is off the topic here.My concern was for Notes entity where we need to have encoding_concept_id to map with a standrard concept_id in concept table.But I cannot find any encoding concept id.can you suggest something here.
I am sorry,I am new to OMOP

TIA

Christian_Reich · February 14, 2019, 3:15pm

@ambuj: No, you are right. There is the field encoding_concept_id. Unbelievable. It snuck in.

Ok, let’s do this: We create the UTF-8 concept so you can do your job, and in the mean time I will try to convince the community to drop that field. It should have never been there to begin with. They may push back with some good reason why it is needed for NLP.

ambuj · February 14, 2019, 3:27pm

So that means I can proceed with setting the encoding_concept_id = 0 ?
Also, if you can shed some light on notes_class_concept_id, that would be great.

Thank You.

schuemie · February 15, 2019, 3:22am

(just for the record: I was advocating we adopt UTF-8 as our standard in our tools)