When I attempt to load the v5 vocab data using the Postgres “copy” command it fails due to delimiters in the data files. The primary issue appears to be that the 2nd column of 4 files contain escaped commas to enable commas in the field values. However, according to the Postgres docs the “backslash is not a special character in the CSV format”(http://www.postgresql.org/docs/9.2/static/sql-copy.html)
Number of entries that contain ‘,’ in column 2:
- CONCEPT.csv: 228168
- CONCEPT_CLASS.csv: 6
- CONCEPT_SYNONYM.csv: 398538
- VOCABULARY.csv: 7
Is there a recommended way to load the new vocabularies into Postgres? I’ve tried a few…
Postgres COPY CSV with HEADER option
Works on files without escaped commas. Fails on escaped data.
COPY cdm.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.CSV' CSV HEADER;
>[Error] Script lines: 1-1 --------------------------
ERROR: extra data after last expected column
Where: COPY concept_class, line 67: "Endocrine Disease,Endocrine\, nutritional and metabolic diseases,44819170"
Line: 1
Postgres COPY DELIMITER ‘,’" option
Does not work on files with header rows because HEADER option is not supported with DELIMETER
COPY public.domain FROM '/Users/wstephens/data/Vocabulary5.0-20141013/DOMAIN.CSV' DELIMITER ','
HEADER;
[Error] Script lines: 2-3 --------------------------
ERROR: COPY HEADER available only in CSV mode
Line: 1
Postgres COPY with ESCAPE character specification
ESCAPE configuration is ignored for CSV
COPY public.concept_class FROM '/Users/wstephens/data/Vocabulary5.0-20141013/CONCEPT_CLASS.CSV' CSV HEADER ESCAPE '\';
[Error] Script lines: 23-23 ------------------------
ERROR: extra data after last expected column
Where: COPY concept_class, line 43: "Congenital Disease,Congenital malformations\, deformations and chromosomal abnormalities,44819183"
Line: 1
Options
- Remove the header row from each of the 4 files that contain escaped commas
- Quote column 2 of each of the 4 files that contain escaped commas
- Write a different data loader
Thoughts?
Bill