New implementor questions re v5 vocabularies

cbearden · December 8, 2014, 10:35pm

We’re getting started with mapping our outpatient clinical & billing data to OMOP. I’ve found v5 of the Common Data Model documentation, and I’ve found the download link for v4 of the Unrestricted Vocabulary, and I’ve found a git repo for Vocabulary v5.0, but it doesn’t seem to include the data files for the vocabularies (or perhaps it’s a work in progress).

It’s straightforward to build the v5 PostgreSQL tables using the CDM, but I’m wondering about the vocabularies. Are the v5 vocabularies still being developed?

Thanks!
Chuck

jon_duke · December 9, 2014, 1:19am

Hi Chuck,

Welcome to OHDSI. You can find the CDM v5 vocabulary content here.

Jon

Juan_Banda · December 9, 2014, 3:34pm

Small tips from our experience:

Special characters are escaped, so loading the tables under certain circumstances can be tricky (PostgreSQL).
If using Linux, there are some Windows special characters lurking around, so be sure to clean them out before loading the tables.

Some front-ends might take care of these issues, but I loaded everything from the command line and those are the little issues I had.

cbearden · December 9, 2014, 5:22pm

Thanks Jon & Juan for the replies. I found the vocabulary files.

I’m curious about their encoding: it doesn’t seem to be UTF-8, Latin-1, Windows-1252, or UTF-16. The ‘ö’ in ‘Sjögren’s syndrome’ appears in CONCEPT.csv as hex

\xbf\xbf

but I don’t see that string occur elsewhere in that file, so it doesn’t seem to be a generic replacement for all byte sequences that need escaping.

Do the files conform to a particular encoding, or is there some special escaping scheme?

Thanks again for your help.
Chuck

cbearden · December 9, 2014, 5:59pm

Hi all,

I think I’ve answered my own questions to some degree, with respect to CONCEPT.csv at least. Mostly it’s in Windows-1252, but a few lines are anomalous (8 by my count). CONCEPT_SYNONYM.csv seems to have a lot of anomalous strings.

Does this sound like something I should take to the developer forum? It seems like there is a way that the vocabularies could be generated in one specified encoding.

Thanks,
Chuck

jon_duke · December 9, 2014, 6:10pm

I think the CDM-Builders forum is probably the best spot. It’s for development stuff but focused on the CDM.

lee_evans · December 9, 2014, 6:25pm

Hi Chuck,

the updated v5 vocabulary data file creation PERL script that I developed recently for the V5 vocabulary process is intended to address these issues.

The file encoding for the new V5 vocabulary file is Latin1 (ISO 8859-1). Special characters (e.g. copyright symbol) and diacritics are converted to somewhat equivalent ASCII characters using the PERL unidecode() module.

A new V5 vocabulary data export file should be available soon (check with Christian Reich for timing) along with associated load scripts (Oracle CTL files, Postgresql copy statements and Sqlserver bulk insert statements).

regards,
Lee

Christian_Reich · December 9, 2014, 6:46pm

Chuck.

Give me another day or two. We are testing the download site. You will be able to pick your vocabularies and get a file that contains all tables in the format that Lee described.

C

From: Lee Evans [mailto:notifications@mail17.wdc04.mandrillapp.com] On Behalf Of Lee Evans

Sent: Tuesday, December 09, 2014 1:28 PM

To: reich@ohdsi.org

Subject: [OHDSI Forums] [Implementers] New implementor questions re v5 vocabularies

lee_evans

December 9

Hi Chuck,

the updated v5 vocabulary data file creation PERL script that I developed recently for the V5 vocabulary process is intended to address these issues.

The file encoding for the new V5 vocabulary file is Latin1 (ISO 8859-1). Special characters (e.g. copyright symbol) and diacritics are converted to somewhat equivalent ASCII characters using the PERL unidecode() module.

A new V5 vocabulary data export file should be available soon (check with Christian Reich for timing) along with associated load scripts (Oracle CTL files, Postgresql copy statements and Sqlserver bulk insert statements).

regards,

Lee

To respond, reply to this email or visit http://forums.ohdsi.org/t/new-implementor-questions-re-v5-vocabularies/143/7 in your browser.

To unsubscribe from these emails, visit your user preferences.

Brandon_Ulrich · December 9, 2014, 8:10pm

Hi Lee,

Are there any plans to represent the vocabularies in UTF-8 and/or support non-Latin characters? In particular, is there a reason that the SNOMED CT descriptions were converted from UTF-8? Lots of Ménière’s disease and the like even in the English language extensions.

Thanks,
Brandon

lee_evans · December 9, 2014, 10:00pm

Hi Brandon,

The goal was to generate a single v5 vocabulary data file that could be loaded into Oracle, Postgres and Sqlserver.

Unfortunately Sqlserver does not support utf8 for bulk inserts, only utf16. Moving to utf16 would have doubled the size in bytes of the dataset and the current size of the file is already a challenge to download for people with slower internet connections.

This approach enables straightforward loading of the data into multiple DBMS and retains the current file size, with the compromise that the example you cited will be loaded as ‘Meniere’s disease’.

I ran a quick query to count the number of v5 concepts that contained at least one diacritic and the count was 152 rows out of a total of over 1.9 million concepts.

I believe the vocabulary tables support utf8 data if one wanted to load utf8 vocabularies locally.

Regards
Lee

cbearden · December 17, 2014, 7:44pm

Hi Lee,

Thank you for your reply. Because the new v5 vocabulary is encoded in ISO 8859-1, and because there is an ISO 8859-1 character for the copyright symbol and for many letter/diacritic combinations, I wonder why these characters are being converted to ASCII.

Thanks,
Chuck

cbearden · December 17, 2014, 7:45pm

Hi Christian,

Will the availability of the new vocabulary site be announced in one of these forums?

Thanks,
Chuck

lee_evans · December 22, 2014, 6:47pm

Hi Chuck,

it’s been a while since I worked on this, but I believe it was because there were chars that were also outside the range of ISO 8859-1 and the perl unidecode() module was able to handle them too.

Lee.

aguynamedryan · January 26, 2015, 5:56pm

@Christian_Reich, @jon_duke , are the files provided at this link (Vocabulary5.0-20141013) the most recent version of the v5 vocabularies?

Also, where do I find the DDLs for the v5 vocabularies?

Thanks!

aguynamedryan · January 26, 2015, 6:03pm

And of course I immediately found the DDL files right after posting: GitHub - OHDSI/CommonDataModel: Definition and DDLs for the OMOP Common Data Model (CDM)

Still wondering about the most recent version of the v5 files though.

Chris_Knoll · February 2, 2015, 10:26pm

Hi, I might be in the minority, but providing the files as UTF-8 encoding and leaving it up to the implementer to do necessary pre-processing (to utf-16 in sql server’s case) might be the way to go. I feel utf8 encoding will leave the base encoding ‘pure’ in that no charaters will try to be shoved into an ISO 8859-1 set (such as what happens with perl unidecode, if my understanding is correct that things are being lost in translation (or trans-coding)).

We can provide the instructions that would be required to convert the raw UTF-8 encode into a desired output encode for a given environment (although it seems the only env that would require this would be sql server because Oracle and postgres handles it?)

-Chris

Chris_Knoll · February 2, 2015, 10:42pm

Sorry for the spam, but I did find this link that talks about importing UTF-8 into sql server:

One of the replies talks about specifying a codepage:

The solution is: just add CODEPAGE=‘65001’ inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.

Here’s his example script:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

Someone want to give that a try?

-Chris

Christian_Reich · February 2, 2015, 10:44pm

Chris:

Didn’t we say we go back to ASCII? No special characters?

Chris_Knoll · February 3, 2015, 1:03am

Was that me? It doesn’t sound like me…

cbearden · February 3, 2015, 4:20pm

Like Chris, I’d prefer to have all the files in a single, well-understood encoding that doesn’t flatten out “special characters” based on the idiosyncrasies of one particular tool or another. UTF-8 sounds good to me. I realize that I’m a relative newcomer to this work, however.