I did some checking and found that the reason for the duplications is as follows:SemMed codes all named entities using UMLS conceptsThere are many UMLS concepts can be mapped to terms in multiple vocabularies. This leads to multiple ways to get to a rxnorm drug and snomed HOI. E.g, In the example below, the UMLS concept is C0033953 (sexual dysfunction) which is mapped to:
---------------------
concept_id | concept_name | domain_id | vocabulary_id | concept_class_id | standard_concept | concept_code | valid_start_date | valid_end_date | invalid_reason
------------+-----------------------+-----------+---------------+------------------+------------------+--------------+------------------+----------------+----------------
36919181 | Psychosexual disorder | Condition | MedDRA | PT | C | 10037222 | 1970-01-01 | 2099-12-31 |
concept_id | concept_name | domain_id | vocabulary_id | concept_class_id | standard_concept | concept_code | valid_start_date | valid_end_date | invalid_reason
------------+------------------------------------+-----------+---------------+------------------+------------------+--------------+------------------+----------------+----------------
45611093 | Sexual Dysfunctions, Psychological | Condition | MeSH | Main Heading | | D020018 | 1970-01-01 | 2099-12-31 |
-----------------
Thus, the SPARQL query that pulls the count data should use a distinct for the drug, hoi, modality, and study type but currently does not because I include the oa id (?an) which could result in differences based on how the UNION operates.---------------------
PREFIX ohdsi:<http://purl.org/net/ohdsi#>
PREFIX oa:<http://www.w3.org/ns/oa#>
PREFIX meddra:<http://purl.bioontology.org/ontology/MEDDRA/>
PREFIX ncbit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX poc: <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc#>
SELECT count(distinct ?an) ?drug ?hoi ?modality ?studyType
FROM <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc>
WHERE {
?an a ohdsi:SemMedDrugHOIAnnotation;
oa:hasTarget ?target;
oa:hasBody ?body.
?target ohdsi:MeshStudyType ?studyType.
?body poc:modality ?modality.
{?body ohdsi:ImedsDrug ?drug.}
UNION { ?body ohdsi:adeAgents ?agents. ?agents ohdsi:ImedsDrug ?drug. }
{?body ohdsi:ImedsHoi ?hoi.}
UNION { ?body ohdsi:adeEffects ?effects. ?effects ohdsi:ImedsHoi ?hoi. }
}
------------
So, I will have to fix this but there is no time before the F2F. So, I will remove duplicates at the WebAPI level until I run this corrected query and reload semmed data:-----------------------
PREFIX ohdsi:<http://purl.org/net/ohdsi#>
PREFIX oa:<http://www.w3.org/ns/oa#>
PREFIX meddra:<http://purl.bioontology.org/ontology/MEDDRA/>
PREFIX ncbit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX poc: <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc#>
SELECT count(distinct ?drug ?hoi ?modality ?studyType)
FROM <http://purl.org/net/nlprepository/ohdsi-pubmed-semmed-poc>
WHERE {
?an a ohdsi:SemMedDrugHOIAnnotation;
oa:hasTarget ?target;
oa:hasBody ?body.
?target ohdsi:MeshStudyType ?studyType.
?body poc:modality ?modality.
{?body ohdsi:ImedsDrug ?drug.}
UNION { ?body ohdsi:adeAgents ?agents. ?agents ohdsi:ImedsDrug ?drug. }
{?body ohdsi:ImedsHoi ?hoi.}
UNION { ?body ohdsi:adeEffects ?effects. ?effects ohdsi:ImedsHoi ?hoi. }
}
Btw, the reason for the difference in counts is that there is one article tagged with UMLS C0020594 (hypoactive sexual desire disorder) that gets mapped to MeSH D020018. Right now, so it gets picked up by one oa but not the other. This will be merged in the distinct set when the query is corrected.