OHDSI Home | Forums | Wiki | Github

New implementor questions re v5 vocabularies

Hi Chuck,

Welcome to OHDSI. You can find the CDM v5 vocabulary content here.

Jon

Small tips from our experience:

  1. Special characters are escaped, so loading the tables under certain circumstances can be tricky (PostgreSQL).
  2. If using Linux, there are some Windows special characters lurking around, so be sure to clean them out before loading the tables.

Some front-ends might take care of these issues, but I loaded everything from the command line and those are the little issues I had.

Thanks Jon & Juan for the replies. I found the vocabulary files.

I’m curious about their encoding: it doesn’t seem to be UTF-8, Latin-1, Windows-1252, or UTF-16. The ‘ö’ in ‘Sjögren’s syndrome’ appears in CONCEPT.csv as hex

\xbf\xbf

but I don’t see that string occur elsewhere in that file, so it doesn’t seem to be a generic replacement for all byte sequences that need escaping.

Do the files conform to a particular encoding, or is there some special escaping scheme?

Thanks again for your help.
Chuck

Hi all,

I think I’ve answered my own questions to some degree, with respect to CONCEPT.csv at least. Mostly it’s in Windows-1252, but a few lines are anomalous (8 by my count). CONCEPT_SYNONYM.csv seems to have a lot of anomalous strings.

Does this sound like something I should take to the developer forum? It seems like there is a way that the vocabularies could be generated in one specified encoding.

Thanks,
Chuck

I think the CDM-Builders forum is probably the best spot. It’s for development stuff but focused on the CDM.

Hi Chuck,

the updated v5 vocabulary data file creation PERL script that I developed recently for the V5 vocabulary process is intended to address these issues.

The file encoding for the new V5 vocabulary file is Latin1 (ISO 8859-1). Special characters (e.g. copyright symbol) and diacritics are converted to somewhat equivalent ASCII characters using the PERL unidecode() module.

A new V5 vocabulary data export file should be available soon (check with Christian Reich for timing) along with associated load scripts (Oracle CTL files, Postgresql copy statements and Sqlserver bulk insert statements).

regards,
Lee

Chuck.

Give me another day or two. We are testing the download site. You will be able to pick your vocabularies and get a file that contains all tables in the format that Lee described.

C

From: Lee Evans [mailto:notifications@mail17.wdc04.mandrillapp.com] On Behalf Of Lee Evans

Sent: Tuesday, December 09, 2014 1:28 PM

To: reich@ohdsi.org

Subject: [OHDSI Forums] [Implementers] New implementor questions re v5 vocabularies

lee_evans

December 9

Hi Chuck,

the updated v5 vocabulary data file creation PERL script that I developed recently for the V5 vocabulary process is intended to address these issues.

The file encoding for the new V5 vocabulary file is Latin1 (ISO 8859-1). Special characters (e.g. copyright symbol) and diacritics are converted to somewhat equivalent ASCII characters using the PERL unidecode() module.

A new V5 vocabulary data export file should be available soon (check with Christian Reich for timing) along with associated load scripts (Oracle CTL files, Postgresql copy statements and Sqlserver bulk insert statements).

regards,

Lee


To respond, reply to this email or visit http://forums.ohdsi.org/t/new-implementor-questions-re-v5-vocabularies/143/7 in your browser.

To unsubscribe from these emails, visit your user preferences.

Image removed by sender.

Hi Lee,

Are there any plans to represent the vocabularies in UTF-8 and/or support non-Latin characters? In particular, is there a reason that the SNOMED CT descriptions were converted from UTF-8? Lots of Ménière’s disease and the like even in the English language extensions.

Thanks,
Brandon

Hi Brandon,

The goal was to generate a single v5 vocabulary data file that could be loaded into Oracle, Postgres and Sqlserver.

Unfortunately Sqlserver does not support utf8 for bulk inserts, only utf16. Moving to utf16 would have doubled the size in bytes of the dataset and the current size of the file is already a challenge to download for people with slower internet connections.

This approach enables straightforward loading of the data into multiple DBMS and retains the current file size, with the compromise that the example you cited will be loaded as ‘Meniere’s disease’.

I ran a quick query to count the number of v5 concepts that contained at least one diacritic and the count was 152 rows out of a total of over 1.9 million concepts.

I believe the vocabulary tables support utf8 data if one wanted to load utf8 vocabularies locally.

Regards
Lee

Hi Lee,

Thank you for your reply. Because the new v5 vocabulary is encoded in ISO 8859-1, and because there is an ISO 8859-1 character for the copyright symbol and for many letter/diacritic combinations, I wonder why these characters are being converted to ASCII.

Thanks,
Chuck

Hi Christian,

Will the availability of the new vocabulary site be announced in one of these forums?

Thanks,
Chuck

Hi Chuck,

it’s been a while since I worked on this, but I believe it was because there were chars that were also outside the range of ISO 8859-1 and the perl unidecode() module was able to handle them too.

Lee.

@Christian_Reich, @jon_duke , are the files provided at this link (Vocabulary5.0-20141013) the most recent version of the v5 vocabularies?

Also, where do I find the DDLs for the v5 vocabularies?

Thanks!

And of course I immediately found the DDL files right after posting: https://github.com/OHDSI/CommonDataModel

Still wondering about the most recent version of the v5 files though.

Hi, I might be in the minority, but providing the files as UTF-8 encoding and leaving it up to the implementer to do necessary pre-processing (to utf-16 in sql server’s case) might be the way to go. I feel utf8 encoding will leave the base encoding ‘pure’ in that no charaters will try to be shoved into an ISO 8859-1 set (such as what happens with perl unidecode, if my understanding is correct that things are being lost in translation (or trans-coding)).

We can provide the instructions that would be required to convert the raw UTF-8 encode into a desired output encode for a given environment (although it seems the only env that would require this would be sql server because Oracle and postgres handles it?)

-Chris

Sorry for the spam, but I did find this link that talks about importing UTF-8 into sql server:

One of the replies talks about specifying a codepage:

The solution is: just add CODEPAGE=‘65001’ inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.

Here’s his example script:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

Someone want to give that a try?

-Chris

Chris:

Didn’t we say we go back to ASCII? No special characters?

Was that me? It doesn’t sound like me…

Like Chris, I’d prefer to have all the files in a single, well-understood encoding that doesn’t flatten out “special characters” based on the idiosyncrasies of one particular tool or another. UTF-8 sounds good to me. I realize that I’m a relative newcomer to this work, however.

Charles, Chris:

We made that decision not because you cannot handle it, but because the different database systems do not treat them equally when you search the data using LIKE or so. Let’s keep it that way. The number of diacritics are really very low, and the whole effort getting it right is not really worth it. We can come back to it later when we have a use case.

t