Athena csv files contain embedded quotes

ColinOrr · December 5, 2017, 12:13pm

I am loading concept data into an Oracle database from csv files downloaded from Athena. Most of the data has loaded successfully but there are about 10k records that failed to load. Most of the data that failed to load contains quotes embedded in the records

e.g.

CONCEPT_ID ;CONCEPT_NAME ; DOMAIN_ID; VOCABULARY_ID ; CONCEPT_CLASS_ID ; STANDARD_CONCEPT ; CONCEPT_CODE ; VALID_START_DATE
; VALID_END_DATE ; INVALID_REASON
4.0491863E7 ; 'Linked to" reference set attribute’ ; Metadata ; SNOMED ; Model Comp ; ; 447257003 ; 20120131 ; 20991231

As you can see from the data, the concept_id is also formatted incorrectly but again, just for these 10k records. Is this a known issue or any recommendations on the easiest way to work around it? Is the data available as a txt file. Note, I am working in a windows environment.

maria.pozhidaeva · December 5, 2017, 12:43pm

Hello,
thank you for reporting - we will check the same case and inform you about workaround or created/fixed task.
Also anyway I don’t see problem to open a .csv file as a text file in a text editor.

BR

ColinOrr · December 5, 2017, 2:15pm

@maria.pozhidaeva, thanks for your response. Yes, I could find an editor that can deal with such large files and update the quotes (") issue but a significant portion of the same data has concept_ids in exponential format (e.g.) 4.0491E7. Slightly more difficult to replace because these IDs are used in other tables.

maria.pozhidaeva · December 6, 2017, 10:30am

@ColinOrr, I’ve checked all steps for loading concepts to Oracle db and all looks fine.
My actions:

Downloading from http://athena.ohdsi.org/vocabulary/list SNOMED vocabulary (including concept with concept_id = 40491863)
In CONCEPT.csv I see line (without “;” as separator and exponential notation):
40491863 “Linked to” reference set attribute Metadata SNOMED Model Comp 447257003 20120131 2
Created concept table (ddl - https://github.com/OHDSI/CommonDataModel/blob/master/Oracle/OMOP%20CDM%20ddl%20-%20Oracle.sql)
Using concept.ctl loaded all data to concept table (https://github.com/OHDSI/CommonDataModel/tree/master/Oracle/VocabImport). Example of command is here - https://github.com/OHDSI/CommonDataModel/blob/master/Oracle/VocabImport/OMOP%20CDM%20vocabulary%20load%20-%20Oracle.bat
Checked existing of concept with concept_id=40491863 with all fields (concept_name is “Linked to” reference set attribute)

Please confirm that you did the same steps or what difference are?
I am confused by the format of your csv file.

BR

ColinOrr · December 6, 2017, 11:02am

@maria.pozhidaeva, thanks for providing those details. It appears that the tool I am using (SQL Developer) is the source of my problems. I was letting the tool create the control file and it was inserting
OPTIONALLY ENCLOSED BY ‘"’ AND ‘"’

this cause the problem both with the embedded quotes and the scientific notation when it reported the problem. I have update the control file removing this option and the data is now loading.

Thanks for your assistance.