Issue with recent vocab files - format/length error?

SELVA_MUTHU_KUMARAN · February 5, 2020, 10:57am

Hello Everyone,

I was trying to upload the recent vocab (Jan 2020) files. But I was able to see that the upload fails due to couple of errors given below

Is it a csv file or tab separated file? Extension is different and file looks different. Anyway I have given the results below

Comma separator

Tab separator - as you can see I have increased the length to 5555 but still it throws error

Am I making any mistake here? The format of one of the vocab files look like as shown below. This is how all other files look.

Has anyone experienced any issues while uploading? I tried with tab, comma, ; and | seperator but nothing works. can somebody let us know whether the issue is with the files? While I could upload other files with tab delimiter but concept.csv doesn’t work with tab or comma separator.

I use pgadmin to upload the files. These are the settings that I use

I was expecting to see the data in a column wise format (after double clicking), cleanly separated.

SELVA_MUTHU_KUMARAN · February 6, 2020, 8:04am

Okay, this works if you split the huge concept file into multiple chunks and save it as csv. But a single file upload may not work I guess

But should we remove all constraints to upload our vocab files?

Looks like there are issues with vocab file which violates the constraints.

Under concept_synonym, there are duplicates records. Should we have to fix the issue one by one as we encounter to upload the vocabs?

After uploading, should we rewrite the constraints? Is this how it’s done usually? I thought usually we have definitions first and then upload data.

Dymshyts · February 7, 2020, 2:40pm

Do you use the scripts from here:
https://github.com/OHDSI/CommonDataModel ?
you have to chose the SQL technology you use, and then run the corresponding script.
Here is the script for PostgreSQL for example
https://github.com/OHDSI/CommonDataModel/blob/master/PostgreSQL/VocabImport/OMOP%20CDM%20vocabulary%20load%20-%20PostgreSQL.sql

SELVA_MUTHU_KUMARAN · February 7, 2020, 11:37pm

@Dymshyts - Thanks for the links, I had a look at it.

Quick question regarding constraints, Do we have to drop constraints to add vocab files? Because I see certain constraints are being violated. So I did see few other posts in forum discussing the same. So, we drop constraints —> add vocabs—> set constraints again?

I also faced another issue . Just thought of reporting it

Thanks

Dymshyts · February 11, 2020, 9:46am

@SELVA_MUTHU_KUMARAN
If you use scripts from here https://github.com/OHDSI/CommonDataModel
this problem shouldn’t occurre, because there’s an order of tables uploading.

If you write some custom script for upload, you might face this problem, so, yes, you have to drop constraints first.

SELVA_MUTHU_KUMARAN · February 20, 2020, 6:30am

@Dymshyts - Actually I tried to follow the order given here. But it still threw error for constraints.

I started with drug_strength table as shown in the code. But it expects the concepts to be already present in concept table. Hence it resulted in foreign key constraint violation error.

If we look at the order in the script that you shared,it shows that concept table is loaded after drug strength.

But unfortunately it still throws error as shown below.

ERROR:  insert or update on table "drug_strength" violates foreign key constraint 
"fpk_drug_strength_concept_1"
DETAIL:  Key (drug_concept_id)=(37594088) is not present in table "concept".

Am I missing anything here? can you correct me if I am making any mistakes?

rookie_crewkie · February 20, 2020, 9:57am

Hello @SELVA_MUTHU_KUMARAN,

Vocabulary data files are tab-separated, so you should use ‘\t’ delimiter. Also, make sure you set the quote character — without it, some concepts from ‘GCN_SEQNO’ vocabulary fail to load (like on your 2nd screenshot).
Verified this with Jan 2020 vocabularies on a fresh PostgreSQL 10 install using the script that @Dymshyts pointed you to:

COPY omop_voc.concept
FROM '/data/vocabs/CONCEPT.csv'
WITH DELIMITER E'\t' CSV HEADER QUOTE E'\b';

Not sure why DRUG_STRENGTH is supposed to be loaded first; I’d propose to start with CONCEPT table and load DRUG_STRENGTH last. Create the tables using the provided DDL, load data into them, then apply foreign key constraints.

SELVA_MUTHU_KUMARAN · February 20, 2020, 10:04am

Hi @rookie_crewkie As I have already fed in my data to the cdm tables and would like to update my vocab version to Jan 2020 now. Mine is not a fresh postgresql install.

Since my vocab tables already had Aug 2019 data, I truncated them now. So constraints are in place as is.

So now to reupload new vocab files, are you suggesting me to drop constraints and add it again after upload?

Or just uploading concept table instead of drug_strength first will resolve the issues?

rookie_crewkie · February 20, 2020, 10:47am

@SELVA_MUTHU_KUMARAN,

You’ll have to drop foreign key constraints at least, because for some tables they are referencing reach other. For example, see ‘fpk_concept_domain’ and ‘fpk_domain_concept’ here — since you’ve truncated both tables already, Postgres won’t allow you to load neither CONCEPT (because it would break ‘fpk_concept_domain’) nor DOMAIN (breaks ‘fpk_domain_concept’). The readme file mentions in #5 that foreign key constraints should be applied after loading the data; the “Note” section there is actually misleading because it’s not feasible to load the data with constraints in place.

A side note on vocabulary updates: since you use ATLAS, you might want to keep the vocabularies which were used during the ETL next to the clinical data, and keep them read-only. The reason is that CDMs are tightly coupled with vocabularies they were built with, especially when you use not only the OHDSI-provided concepts but also the custom ones, created specially for your data. But vocabularies constantly change — new concepts are added, some old are deprecated, some made non-standard, some change domains etc. (you can track the details of the changes here), — therefore, it’s usually safer to rerun the ETL with new vocabs if you want to use the latest ones, because you won’t have hard-to-debug situations when e.g. a concept has changed domain but the records in your CDM have not. You can, of course, write a migration script which will update the CDM according to the vocabulary changes, but this is a more error-prone path in general (though might be preferable if your ETL takes ages).
I think this whole thing was described somewhere in the Book of OHDSI, you might want to check it out.

So, instead of truncating the tables, consider the following approach:

Create a new separate schema for new version of vocabularies and load new data there.
Rerun the ETL of your clinical data using new vocabularies instead of the ones used before.
If you don’t need old vocabularies, drop them and use the updated ones going forward.

This process may look too cumbersome, but it saves you a lot of potential headache with the updates.