We’re getting started with mapping our outpatient clinical & billing data to OMOP. I’ve found v5 of the Common Data Model documentation, and I’ve found the download link for v4 of the Unrestricted Vocabulary, and I’ve found a git repo for Vocabulary v5.0, but it doesn’t seem to include the data files for the vocabularies (or perhaps it’s a work in progress).
It’s straightforward to build the v5 PostgreSQL tables using the CDM, but I’m wondering about the vocabularies. Are the v5 vocabularies still being developed?
Thanks Jon & Juan for the replies. I found the vocabulary files.
I’m curious about their encoding: it doesn’t seem to be UTF-8, Latin-1, Windows-1252, or UTF-16. The ‘ö’ in ‘Sjögren’s syndrome’ appears in CONCEPT.csv as hex
\xbf\xbf
but I don’t see that string occur elsewhere in that file, so it doesn’t seem to be a generic replacement for all byte sequences that need escaping.
Do the files conform to a particular encoding, or is there some special escaping scheme?
I think I’ve answered my own questions to some degree, with respect to CONCEPT.csv at least. Mostly it’s in Windows-1252, but a few lines are anomalous (8 by my count). CONCEPT_SYNONYM.csv seems to have a lot of anomalous strings.
Does this sound like something I should take to the developer forum? It seems like there is a way that the vocabularies could be generated in one specified encoding.
the updated v5 vocabulary data file creation PERL script that I developed recently for the V5 vocabulary process is intended to address these issues.
The file encoding for the new V5 vocabulary file is Latin1 (ISO 8859-1). Special characters (e.g. copyright symbol) and diacritics are converted to somewhat equivalent ASCII characters using the PERL unidecode() module.
A new V5 vocabulary data export file should be available soon (check with Christian Reich for timing) along with associated load scripts (Oracle CTL files, Postgresql copy statements and Sqlserver bulk insert statements).
Give me another day or two. We are testing the download site. You will be able to pick your vocabularies and get a file that contains all tables in the format that Lee described.
C
From: Lee Evans [mailto:notifications@mail17.wdc04.mandrillapp.com] On Behalf Of Lee Evans
the updated v5 vocabulary data file creation PERL script that I developed recently for the V5 vocabulary process is intended to address these issues.
The file encoding for the new V5 vocabulary file is Latin1 (ISO 8859-1). Special characters (e.g. copyright symbol) and diacritics are converted to somewhat equivalent ASCII characters using the PERL unidecode() module.
A new V5 vocabulary data export file should be available soon (check with Christian Reich for timing) along with associated load scripts (Oracle CTL files, Postgresql copy statements and Sqlserver bulk insert statements).
Are there any plans to represent the vocabularies in UTF-8 and/or support non-Latin characters? In particular, is there a reason that the SNOMED CT descriptions were converted from UTF-8? Lots of Ménière’s disease and the like even in the English language extensions.
The goal was to generate a single v5 vocabulary data file that could be loaded into Oracle, Postgres and Sqlserver.
Unfortunately Sqlserver does not support utf8 for bulk inserts, only utf16. Moving to utf16 would have doubled the size in bytes of the dataset and the current size of the file is already a challenge to download for people with slower internet connections.
This approach enables straightforward loading of the data into multiple DBMS and retains the current file size, with the compromise that the example you cited will be loaded as ‘Meniere’s disease’.
I ran a quick query to count the number of v5 concepts that contained at least one diacritic and the count was 152 rows out of a total of over 1.9 million concepts.
I believe the vocabulary tables support utf8 data if one wanted to load utf8 vocabularies locally.
Thank you for your reply. Because the new v5 vocabulary is encoded in ISO 8859-1, and because there is an ISO 8859-1 character for the copyright symbol and for many letter/diacritic combinations, I wonder why these characters are being converted to ASCII.
it’s been a while since I worked on this, but I believe it was because there were chars that were also outside the range of ISO 8859-1 and the perl unidecode() module was able to handle them too.
Hi, I might be in the minority, but providing the files as UTF-8 encoding and leaving it up to the implementer to do necessary pre-processing (to utf-16 in sql server’s case) might be the way to go. I feel utf8 encoding will leave the base encoding ‘pure’ in that no charaters will try to be shoved into an ISO 8859-1 set (such as what happens with perl unidecode, if my understanding is correct that things are being lost in translation (or trans-coding)).
We can provide the instructions that would be required to convert the raw UTF-8 encode into a desired output encode for a given environment (although it seems the only env that would require this would be sql server because Oracle and postgres handles it?)
Sorry for the spam, but I did find this link that talks about importing UTF-8 into sql server:
One of the replies talks about specifying a codepage:
The solution is: just add CODEPAGE=‘65001’ inside the with statement of the bulk insert. (65001=codepage number for UTF-8). Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.
Here’s his example script:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = '65001',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);
Like Chris, I’d prefer to have all the files in a single, well-understood encoding that doesn’t flatten out “special characters” based on the idiosyncrasies of one particular tool or another. UTF-8 sounds good to me. I realize that I’m a relative newcomer to this work, however.