OHDSI Home | Forums | Wiki | Github

Loading CDM v5 Vocabularies


(William Stephens) #1

When I attempt to load the v5 vocab data using the Postgres “copy” command it fails due to delimiters in the data files. The primary issue appears to be that the 2nd column of 4 files contain escaped commas to enable commas in the field values. However, according to the Postgres docs the “backslash is not a special character in the CSV format”(http://www.postgresql.org/docs/9.2/static/sql-copy.html)

Number of entries that contain ‘,’ in column 2:

  1. CONCEPT.csv: 228168
  2. CONCEPT_CLASS.csv: 6
  3. CONCEPT_SYNONYM.csv: 398538
  4. VOCABULARY.csv: 7

Is there a recommended way to load the new vocabularies into Postgres? I’ve tried a few…

Postgres COPY CSV with HEADER option
Works on files without escaped commas. Fails on escaped data.

COPY cdm.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.CSV' CSV HEADER;
>[Error] Script lines: 1-1 --------------------------
 ERROR: extra data after last expected column
  Where: COPY concept_class, line 67: "Endocrine Disease,Endocrine\, nutritional and metabolic diseases,44819170"
 Line: 1 

Postgres COPY DELIMITER ‘,’" option
Does not work on files with header rows because HEADER option is not supported with DELIMETER

COPY public.domain FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DOMAIN.CSV' DELIMITER ',' 
HEADER;
[Error] Script lines: 2-3 --------------------------
ERROR: COPY HEADER available only in CSV mode
Line: 1

Postgres COPY with ESCAPE character specification
ESCAPE configuration is ignored for CSV

COPY public.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.CSV' CSV HEADER ESCAPE '\';
[Error] Script lines: 23-23 ------------------------
ERROR: extra data after last expected column
  Where: COPY concept_class, line 43: "Congenital Disease,Congenital malformations\, deformations and chromosomal abnormalities,44819183"
 Line: 1 

Options

  1. Remove the header row from each of the 4 files that contain escaped commas
  2. Quote column 2 of each of the 4 files that contain escaped commas
  3. Write a different data loader

Thoughts?
Bill


(Lee Evans) #2

I’ve been working on an update to the V5 vocabulary generation script and associated loading copy/bulk insert/CTL files which should create a file that can be more easily loaded into Oracle, Postgres and SQL Server.

I am sending Christian the new script and I will ask him to run it to generate a new V5 vocab file. I also plan to add the loading files (including postgres copy statements) to github later today.

I’ll post back to this thread with the locations of the github files after they are added.

It would be great to get your feedback on the new V5 vocab file and load scripts once they are available and you have a chance to try them out.


(William Stephens) #3

Fantastic! I’m new to Postgres so this has been a bit if a challenge. I’mm happy to provide feedback.

Bill


(Chris Davidson) #4

I would be more than happy to test out the new load scripts for testing purposes also.

Kind Regards,
Chris


(William Stephens) #5

@lee_evans

Did you get an opportunity to add the loading files to Github?

Thanks,
Bill


(Charles Bailey) #6
As a **very** temporary stopgap, here's a quick filter that yields Pg-friendly CSV files:

    #!/usr/bin/env perl -CO
    # Munge pseudo-CSV into Postgres-friendly CSV

    while (<>) {
      chomp;
      my @fields =
    map { s/\\,/,/g;
          s/[\x00-\x1f]/_/g;   # T/c expanding to hex code; would need
                               # to trim resultant string to 255 chars
          /\D/ ? qq["$_"] : $_ }
    split /(?<!\\),/, $_, -1;
      print join(',',@fields),"\n";
    }

    # To import resultant file into Postgres, consider psql
    # user => \copy table from 'table.csv' CSV HEADER ESCAPE E'\\'

(Lee Evans) #7

I believe they are still pending a merge into Github from the repository owner.

FYI.
Below is an example for Postgresql (the copy statements for all the tables look the same), but it will only work with the matching version of the v5 data file. The matching data file hasn’t been released yet.

COPY CONCEPT FROM ‘C:\CDMV5VOCAB\CONCEPT.csv’ WITH DELIMITER E’\t’ CSV HEADER QUOTE E’\b’ ;

regards,
Lee


(William Stephens) #8

Charles,

I’m using perl 5.16. It appears that the “-CO” option was deprecated after 5.9. Here is my fix for that:

#!/usr/bin/perl -w
# Munge pseudo-CSV into Postgres-friendly CSV

# use UTF-8 output
use open OUT => ':utf8', ':std';

while (<>) {
  chomp;
  my @fields =
map { s/\\,/,/g;
      s/[\x00-\x1f]/_/g;   # T/c expanding to hex code; would need
                           # to trim resultant string to 255 chars
      /\D/ ? qq["$_"] : $_ }
split /(?<!\\),/, $_, -1;
  print join(',',@fields),"\n";
}

I ran this as follows:

cat VOCABULARY.csv | perl pgParser.pl > VOCABULARY.pg.csv
cat CONCEPT_CLASS.csv | perl pgParser.pl > CONCEPT_CLASS.pg.csv
cat CONCEPT_SYNONYM.csv | perl pgParser.pl > CONCEPT_SYNONYM.pg.csv
cat CONCEPT.csv | perl pgParser.pl > CONCEPT.pg.csv

Loading In Postgres:

COPY public.vocabulary FROM '/Users/wstephens/data/Vocabulary5.0-20141013/VOCABULARY.pg.csv' CSV HEADER  ESCAPE E'\\';
COPY public.domain FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DOMAIN.csv' CSV HEADER;
COPY public.concept_synonym FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_SYNONYM.pg.csv' CSV HEADER ESCAPE E'\\';
COPY public.concept_relationship FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_RELATIONSHIP.csv' CSV HEADER;
COPY public.concept_ancestor FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_ANCESTOR.csv' CSV HEADER;
COPY public.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.pg.csv' CSV HEADER  ESCAPE E'\\';
COPY public.concept FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT.pg.csv' CSV HEADER ESCAPE E'\\';
COPY public.relationship FROM '/Users/wstephens/data/Vocabulary5.0-20141013/RELATIONSHIP.csv' CSV HEADER;
COPY public.drug_strength FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DRUG_STRENGTH.csv' CSV HEADER;

NOTE
I see an interesting character in the NDFRT vocab that may be an em dash rather than a standard hyphen?

Original VOCABULARY.csv:

NDFRT,National Drug File Reference Terminology (VA)

After running this perl script:

“NDFRT”,"National Drug File – Reference Terminology (VA

Many thanks,
Bill


(Ming) #9

@lee_evans
There are some rows in concept data file are not properly formatted, like missing concept_name. Does it mean I need to modify table schema accordingly? How did you handle it? Thanks.


(Christian Reich) #10

Ming.

Not sure Lee got the chance. Plus, he is innocent.

There are two vocabularies where we noticed this problem: GPI and Read. We fixed it in the latter.

Did you see that anywhere else?

Do you need the GPIs?

Best,

Christian


(Ming) #11

Sorry, Lee. I didn’t mean to put you on the spot. :smiley:

Christian, there are ICD10 as well. Thanks.


(Christian Reich) #12

ICD-10 will be fixed as well.

C


(Adam Black) #13

HI @lee_evans. I would like to load the OMOP vocab into MS SQL server and found your import script here: https://github.com/leeevans/Vocabulary-v5.0/tree/master/Import/SQL%20Server
Do you have SQL code that creates the vocab tables or at least know where I can find the schema/variable name/variable type information. I’m new to OHDSI so forgive me if this is a naive question. Thanks!


(Martijn Schuemie) #14

Hi Adam. The vocab tables are part of the Common Data Model. You can find the MS SQL scripts for creating the CDM tables, constraints and indices here.


(Andrew Williams) #15

Adam! Nice to see you on the forums. :grinning: Let me know if you have any problems with this. I assume the version that was still in place earlier in the year on the dev server has been nixed or you wouldn’t be needing to do this over. Doing over is easy. There’s a little license request. When it comes to doing the vocab downloads you probably want to check a couple of non-default vocab options - MEDRA comes to mind. The forums are great. You can always call me or Robert if you don’t get what you need here.


t