OHDSI Home | Forums | Wiki | Github

Loading CDM v5 Vocabularies

When I attempt to load the v5 vocab data using the Postgres “copy” command it fails due to delimiters in the data files. The primary issue appears to be that the 2nd column of 4 files contain escaped commas to enable commas in the field values. However, according to the Postgres docs the “backslash is not a special character in the CSV format”(http://www.postgresql.org/docs/9.2/static/sql-copy.html)

Number of entries that contain ‘,’ in column 2:

  1. CONCEPT.csv: 228168
  2. CONCEPT_CLASS.csv: 6
  3. CONCEPT_SYNONYM.csv: 398538
  4. VOCABULARY.csv: 7

Is there a recommended way to load the new vocabularies into Postgres? I’ve tried a few…

Postgres COPY CSV with HEADER option
Works on files without escaped commas. Fails on escaped data.

COPY cdm.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.CSV' CSV HEADER;
>[Error] Script lines: 1-1 --------------------------
 ERROR: extra data after last expected column
  Where: COPY concept_class, line 67: "Endocrine Disease,Endocrine\, nutritional and metabolic diseases,44819170"
 Line: 1 

Postgres COPY DELIMITER ‘,’" option
Does not work on files with header rows because HEADER option is not supported with DELIMETER

COPY public.domain FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DOMAIN.CSV' DELIMITER ',' 
HEADER;
[Error] Script lines: 2-3 --------------------------
ERROR: COPY HEADER available only in CSV mode
Line: 1

Postgres COPY with ESCAPE character specification
ESCAPE configuration is ignored for CSV

COPY public.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.CSV' CSV HEADER ESCAPE '\';
[Error] Script lines: 23-23 ------------------------
ERROR: extra data after last expected column
  Where: COPY concept_class, line 43: "Congenital Disease,Congenital malformations\, deformations and chromosomal abnormalities,44819183"
 Line: 1 

Options

  1. Remove the header row from each of the 4 files that contain escaped commas
  2. Quote column 2 of each of the 4 files that contain escaped commas
  3. Write a different data loader

Thoughts?
Bill

I’ve been working on an update to the V5 vocabulary generation script and associated loading copy/bulk insert/CTL files which should create a file that can be more easily loaded into Oracle, Postgres and SQL Server.

I am sending Christian the new script and I will ask him to run it to generate a new V5 vocab file. I also plan to add the loading files (including postgres copy statements) to github later today.

I’ll post back to this thread with the locations of the github files after they are added.

It would be great to get your feedback on the new V5 vocab file and load scripts once they are available and you have a chance to try them out.

1 Like

Fantastic! I’m new to Postgres so this has been a bit if a challenge. I’mm happy to provide feedback.

Bill

I would be more than happy to test out the new load scripts for testing purposes also.

Kind Regards,
Chris

@lee_evans

Did you get an opportunity to add the loading files to Github?

Thanks,
Bill

As a **very** temporary stopgap, here's a quick filter that yields Pg-friendly CSV files:

    #!/usr/bin/env perl -CO
    # Munge pseudo-CSV into Postgres-friendly CSV

    while (<>) {
      chomp;
      my @fields =
    map { s/\\,/,/g;
          s/[\x00-\x1f]/_/g;   # T/c expanding to hex code; would need
                               # to trim resultant string to 255 chars
          /\D/ ? qq["$_"] : $_ }
    split /(?<!\\),/, $_, -1;
      print join(',',@fields),"\n";
    }

    # To import resultant file into Postgres, consider psql
    # user => \copy table from 'table.csv' CSV HEADER ESCAPE E'\\'

I believe they are still pending a merge into Github from the repository owner.

FYI.
Below is an example for Postgresql (the copy statements for all the tables look the same), but it will only work with the matching version of the v5 data file. The matching data file hasn’t been released yet.

COPY CONCEPT FROM ‘C:\CDMV5VOCAB\CONCEPT.csv’ WITH DELIMITER E’\t’ CSV HEADER QUOTE E’\b’ ;

regards,
Lee

Charles,

I’m using perl 5.16. It appears that the “-CO” option was deprecated after 5.9. Here is my fix for that:

#!/usr/bin/perl -w
# Munge pseudo-CSV into Postgres-friendly CSV

# use UTF-8 output
use open OUT => ':utf8', ':std';

while (<>) {
  chomp;
  my @fields =
map { s/\\,/,/g;
      s/[\x00-\x1f]/_/g;   # T/c expanding to hex code; would need
                           # to trim resultant string to 255 chars
      /\D/ ? qq["$_"] : $_ }
split /(?<!\\),/, $_, -1;
  print join(',',@fields),"\n";
}

I ran this as follows:

cat VOCABULARY.csv | perl pgParser.pl > VOCABULARY.pg.csv
cat CONCEPT_CLASS.csv | perl pgParser.pl > CONCEPT_CLASS.pg.csv
cat CONCEPT_SYNONYM.csv | perl pgParser.pl > CONCEPT_SYNONYM.pg.csv
cat CONCEPT.csv | perl pgParser.pl > CONCEPT.pg.csv

Loading In Postgres:

COPY public.vocabulary FROM '/Users/wstephens/data/Vocabulary5.0-20141013/VOCABULARY.pg.csv' CSV HEADER  ESCAPE E'\\';
COPY public.domain FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DOMAIN.csv' CSV HEADER;
COPY public.concept_synonym FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_SYNONYM.pg.csv' CSV HEADER ESCAPE E'\\';
COPY public.concept_relationship FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_RELATIONSHIP.csv' CSV HEADER;
COPY public.concept_ancestor FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_ANCESTOR.csv' CSV HEADER;
COPY public.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.pg.csv' CSV HEADER  ESCAPE E'\\';
COPY public.concept FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT.pg.csv' CSV HEADER ESCAPE E'\\';
COPY public.relationship FROM '/Users/wstephens/data/Vocabulary5.0-20141013/RELATIONSHIP.csv' CSV HEADER;
COPY public.drug_strength FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DRUG_STRENGTH.csv' CSV HEADER;

NOTE
I see an interesting character in the NDFRT vocab that may be an em dash rather than a standard hyphen?

Original VOCABULARY.csv:

NDFRT,National Drug File Reference Terminology (VA)

After running this perl script:

“NDFRT”,"National Drug File – Reference Terminology (VA

Many thanks,
Bill

@lee_evans
There are some rows in concept data file are not properly formatted, like missing concept_name. Does it mean I need to modify table schema accordingly? How did you handle it? Thanks.

Ming.

Not sure Lee got the chance. Plus, he is innocent.

There are two vocabularies where we noticed this problem: GPI and Read. We fixed it in the latter.

Did you see that anywhere else?

Do you need the GPIs?

Best,

Christian

Sorry, Lee. I didn’t mean to put you on the spot. :smiley:

Christian, there are ICD10 as well. Thanks.

ICD-10 will be fixed as well.

C

HI @lee_evans. I would like to load the OMOP vocab into MS SQL server and found your import script here: https://github.com/leeevans/Vocabulary-v5.0/tree/master/Import/SQL%20Server
Do you have SQL code that creates the vocab tables or at least know where I can find the schema/variable name/variable type information. I’m new to OHDSI so forgive me if this is a naive question. Thanks!

Hi Adam. The vocab tables are part of the Common Data Model. You can find the MS SQL scripts for creating the CDM tables, constraints and indices here.

1 Like

Adam! Nice to see you on the forums. :grinning: Let me know if you have any problems with this. I assume the version that was still in place earlier in the year on the dev server has been nixed or you wouldn’t be needing to do this over. Doing over is easy. There’s a little license request. When it comes to doing the vocab downloads you probably want to check a couple of non-default vocab options - MEDRA comes to mind. The forums are great. You can always call me or Robert if you don’t get what you need here.

1 Like

This may be an old issue, but other than the SQL Server Bulk vocabulary import code, is there a community example of vocab loads with bcp. We are hitting errors using BCP possibly due to null at the end of lines.

The files are showing as plain-text ASCII file type on a ubuntu vm hosting Rstudio
$ file -bi CONCEPT.csv
text/plain; charset=us-ascii
which connects to the cdm hosted in the same Azure environment and tested connection. The Athena files are labelled csv, but have tab column separators

Attempted -k flag , and specified EOL format

$/opt/mssql-tools18/bin/bcp “cdm.concept” in “concept.csv” -S <myserver,port> -U -P -d cdm -u -F 2 -k -e errlog.txt -r\n

I can only get rows to parse as text only with the -c flag, into an all all VARCHAR staging table, but that seemed like an unnecessary step to add. Switching to pipe delimited file also didn’t help with direct parsing. Any tips would be appreciated.

Holt Oliver

t