New implementor questions re v5 vocabularies

lee_evans · December 9, 2014, 10:00pm

Hi Brandon,

The goal was to generate a single v5 vocabulary data file that could be loaded into Oracle, Postgres and Sqlserver.

Unfortunately Sqlserver does not support utf8 for bulk inserts, only utf16. Moving to utf16 would have doubled the size in bytes of the dataset and the current size of the file is already a challenge to download for people with slower internet connections.

This approach enables straightforward loading of the data into multiple DBMS and retains the current file size, with the compromise that the example you cited will be loaded as ‘Meniere’s disease’.

I ran a quick query to count the number of v5 concepts that contained at least one diacritic and the count was 152 rows out of a total of over 1.9 million concepts.

I believe the vocabulary tables support utf8 data if one wanted to load utf8 vocabularies locally.

Regards
Lee