Duplicate Entries in Concept Synonym

DTorok · June 20, 2016, 12:43pm

select count(*) from concept_synonym
– return 4469624 rows

SELECT count()
FROM
(
select distinct concept_synonym.
FROM concept_synonym
)
– return 4423742

That is 45,882 non unique rows

Suggest adding PK
ALTER TABLE concept_synonym ADD CONSTRAINT pk_concept_synonym PRIMARY KEY (concept_id,concept_synonym_name,language_concept_id);

Timur · June 20, 2016, 9:37pm

fixed in latest release

MarioVillageMD · September 25, 2017, 5:27pm

I loaded the latest data model (5.2) and I do not see any unique constraints for the concept_synonym table?

I did notice that the concept_synonym.csv export did not contain duplicates this time compared to the first time we downloaded it. The first time we selected more vocabularies compared two only the CVX & ICD10 vocabularies this time.

I’m wondering what was fixed.

Christian_Reich · September 25, 2017, 5:34pm

@MarioVillageMD:

Well, the constraint was added, so duplicate entries cannot be loaded anymore. Was that your question?

MarioVillageMD · September 25, 2017, 6:57pm

@Christian_Reich , thank you for your reply!

I dont see a UNIQUE constraint in the SQL scripts for postgres.
OmopConstraints5.2ForPostgresql

The only constraint I see is to ensure the concept_id exists in the concepts table:

ALTER TABLE concept_synonym ADD CONSTRAINT fpk_concept_synonym_concept FOREIGN KEY (concept_id)  REFERENCES concept (concept_id);

Am I missing something?

I’m looking into this because the original CONCEPT_SYNONYM file we loaded (a year ago) had duplicate identical rows. The file I pulled down from the vocabulary download page (this week) did not have any duplicates. I am wondering if the downloads in the past had a bug that caused the files to contain duplicate CONCEPT_SYNONYMS.

Christian_Reich · September 26, 2017, 1:11pm

@MarioVillageMD:

I see. Well, here is what happens: The vocabulary gets built in a separate process to be called PALLAS and the constraint was introduced there. PALLAS doesn’t use the public DDL files. Which means, the constraint never made it into the DDLs. But since no user in his right mind builds the vocabs, but instead downloads it from ATHENA, it doesn’t really matter that much. An additional constraint in people’s instances would be another good quality check, though.

But: Since it is a Open Source Project - would you mind putting a pull request in and adding the constraint?

MarioVillageMD · September 26, 2017, 2:08pm

@Christian_Reich:
I would love to contribute. Thank you for the information. I will look into making the pull request.

in case anybody hits this thread in the future, this is how i removed the duplicates in Postgresql:

CREATE TABLE concept_synonym_temp as SELECT * FROM concept_synonym; 
TRUNCATE TABLE concept_synonym;
INSERT INTO (concept_id, concept_synonym_name, language_concept_id)
    SELECT concept_id, concept_synonym_name, language_concept_id 
    FROM concept_synonym_temp 
    GROUP BY concept_id, concept_synonym_name, language_concept_id;
DROP TABLE concept_synonym_temp;

Since there is no primary key on concept_synonym, I added a unique constraint on all columns:

ALTER TABLE concept_synonym ADD CONSTRAINT uq_concept_synonym
UNIQUE (concept_id, concept_synonym_name, language_concept_id);