A colleague new to the community pointed out that in the latest vocabulary, there are records in the CONCEPT table that are seemingly ‘redundant’, in that they have identical values for all fields, but are assigned different concept_ids. I independently wrote code to look into it, and I only see 3 such instances.
So, comment #1: I think in these 3 instances, we probably want to deprecate one of the records.
Comment #2: it might be useful to use this check as a diagnostic in our vocabulary testing procedure.
Comment #3: several concepts look ‘pretty close’ but are slightly different (ex: concept codes being case sensitive, valid start dates being different), so heads up to all those using the vocabulary that it’s important to look across the full record when evaluating a concept.
Here was my script to check for ‘duplicates’:
--look for duplicate conceptids
select * from
(
select concept_id, count(concept_id) as num_records
from concept
group by concept_id
) t1
where num_records > 1;
–find concepts with similar characteristics (except for concept_id)
select * from
(
select vocabulary_id, concept_name, domain_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason, count(concept_id) as num_records
from concept
group by vocabulary_id, concept_name, domain_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason
) t1
where num_records > 1
order by vocabulary_id, concept_name, domain_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason;
–find all records for concepts with similar characteristics
select c1.* from
(
select vocabulary_id, concept_name, domain_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason, count(concept_id) as num_records
from concept
group by vocabulary_id, concept_name, domain_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason
) t1
inner join
concept c1
on t1.vocabulary_id = c1.vocabulary_id and t1.concept_name = c1.concept_name and c1.concept_code = t1.concept_code
where num_records > 1
order by c1.vocabulary_id, c1.concept_name;
And here are the 3 ‘problem children’ concepts I identified:
concept_id concept_name domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date valid_end_date invalid_reason
45756774 Female Pelvic Medicine and Reconstructive Surgery Provider Specialty ABMS Specialty S OMOP generated 1970-01-01 2099-12-31 NULL
45756773 Female Pelvic Medicine and Reconstructive Surgery Provider Specialty ABMS Specialty S OMOP generated 1970-01-01 2099-12-31 NULL
44819282 Canonical Unit Metadata Concept Class Concept Class NULL OMOP generated 1970-01-01 2099-12-31 NULL
44819053 Canonical Unit Metadata Concept Class Concept Class NULL OMOP generated 1970-01-01 2099-12-31 NULL
46233686 Patient moved to Metadata Relationship Relationship S OMOP generated 1970-01-01 2099-12-31 NULL
46233680 Patient moved to Metadata Relationship Relationship S OMOP generated 1970-01-01 2099-12-31 NULL
46233687 Patient moved to Metadata Relationship Relationship S OMOP generated 1970-01-01 2099-12-31 NULL