LAERTES Discussion Thread

Vojtech_Huser · July 23, 2015, 3:58pm

There is a LAERTES Minutes thread and I though a good way to share an update for the LAERTES team is to start a generic thread for LAERTES here.

I was working on ClinicalTrials.gov data integration into LAERTES.
I recently made another incremental progress.

We can extract links from trials to CDM-Vocabulary (CDMV) coded drugs. For example here:

drug_CID, HOI_CID, trial,  drug_name
1338512    0    NCT00003907    doxorubicin
1389036    0    NCT00003907    mitomycin
1344381    0    NCT00004054    bicalutamide
1378382    0    NCT00004054    paclitaxel
1350504    0    NCT00004054    etoposide
1356461    0    NCT00004054    flutamide
19012585    0    NCT00004228    asparaginase
1310317    0    NCT00004228    cyclophosphamide
1437379    0    NCT00004228    thioguanine
1436650    0    NCT00004228    mercaptopurine
1311078    0    NCT00004228    cytarabine
1518254    0    NCT00004228    dexamethasone
1305058    0    NCT00004228    methotrexate
1551099    0    NCT00004228    prednisone
1310317    0    NCT00004563    cyclophosphamide

We can have 3000+ rows like that of NLPed nice concepts (CID=concept_id).
(all rows are about trials where there are results (hence AEs) and also all have concepts we reliably linked to CDMV CIDs.

The problems with the HOI part (health outcome of interest) are the following: (based on manual review of some of the rows)

It is difficult to reliably “computationally parse the HOIs in
CT.gov”

For example, the problems are:

drug1 and drug2

an arm (or whole single arm trial) can have 2 drugs being given.
Then it is hard(impossible) to know which drug exactly caused the AE

drug1 or drug2

trial arm is titled: Ciprofloxacin or Ofloxacin
(both drugs are detected by NLP), impossible to assign AE
https://clinicaltrials.gov/ct2/show/results/NCT00002850?sect=X30156#evnt

drug and radiation

AE can be due to non-drug intervention (radiation) (and each arm
has drug+radiation)
https://clinicaltrials.gov/ct2/show/results/NCT00003377?sect=X30156#evnt

WHEN IT IS EASY: (and possible)

clinical trial with one intervention (no non-drug) and with one
group defined
https://clinicaltrials.gov/ct2/show/results/NCT00001941?sect=X30156#evnt

CONCLUSION

For LAERTES – we either may have to forget about CT.gov as data provider for LAERTES since it can not reliably always produce a drug-hoi[-trial] data row.

Or, we could introduce a “semi-automatic” mode or evidence review (we give you
links for you to disambiguate (and even human will not be able to disambiguate
in many cases (e.g., radiation+drug)

We would have a new table in the LAERTES schema for that that would have only 2 key colums
(drug_CID, link_out_link (for manual review;leading to the ct.gov AE tables and trial arms structure))

Christian_Reich · August 15, 2015, 1:30pm

@Vojtech_Huser:

Why would this problem be any different to the data sources? For example, for combination products you cannot assign AEs to either of the ingredients. It is probably a good approximation to assume that the vast majority of AEs is caused by a certain ingredient, and a drug-drug combination only causing something is a rare exception.

Why wouln’t we limit Laertes to singletons for the time being?

rkboyce · August 17, 2015, 8:56pm

Let’s keep this as a regular discussion item on the Laertes WG meetings. I would like to have the singletons in place for the next release set for 10/1.

Vojtech, have you checked with Hua Xu about how they handled these issues and if we can use any of their tools or data?

Vojtech_Huser · August 27, 2015, 1:40pm

A reply from a team at U of Texas was:

They produce the CATTLE database at this URL

We do not extract AE from clinicaltrials.gov. We did some work on
identification of trials about drugs-treat-cancer, which has similar problems
about multi-drug mentions for cancer trials.

More replies:

We had a similar problem
in our project to collect all the drugs for cancer treatment from
ClinicalTrials.gov.

For the drugs in the ‘intervention’ section of a cancer treatment trial,
it is not always true that all the drugs are for cancer
treatment. Some of the drugs might be for treating pain or side
effects. Take a look at the following trials for example:

ClinicalTrials.gov (single
drug example): the drug ONO-7746 is to deal with
thrombocytopenia caused by chemotherapy and not to treat cancer.

ClinicalTrials.gov
(multiple drug example): the drug tacrolimus is an
immunosuppressive drug for care after organ transplant. All other
drugs are for cancer treatment.

We developed a simple scoring system that assigns scores to the drugs according
to their likelihood to be for cancer treatment. The system utilizes
co-occurrences of cancer and drug names in the title and purpose sections of
the trials, and prior knowledge about the drugs’ original indication.

The asked a f/u question

It seems that your
current goal is to find out ‘HOI’ for each drug in the trials. Could you
elaborate on your current goal, so we can discuss more about the
problem? I’m sure that similar problem will occur for other categories of
‘HOI’ as well.

Vojtech: It is interesting that you also ran into the problem of drug-generalHOI vs drug-supposedToTreatDx(HOI) (in your case supposedToTreatCancer). I think we need to distinguish these two types of HOIs in LAERTES somehow as well. (which we currently don’t).

Getting the trial-drug pair right is phase 1 for us.
Getting the drug-AE pair right is phase 2. For phase 2 many of the AEs are strings that are exact match to MedDRA and MedDRA is in CDM Vocabulary so the NLP part may be easier there.

herrcerd · August 27, 2015, 2:10pm

For what it is worth: we have recently finished an integration with the CT.gov dataset (not with CDM - internal system). The following are a few items we discovered that were tricky with respect to MedDRA AE matching

For historical trial data - MedDRA versions can be quite old (12.1, 13.1, 14.1 are common). This causes term currency issues when attempting to map only to PTs.
AE data are reported at both SOC and PT level (again, versioning can create a situation where a PT at time results were recorded has now been demoted to LLT).
We found some cases where the category.title of the event_list were not actual SOCs (and not the standard Total… labels).

For drugs, we used the intervention_browse table in the csv dataset (which is pretty poor quality honestly) to help us build a classifier for mesh => rxnorm cui to at least “hint” at proper mappings. We still had to do A TON of manual curation, as you stated the free form text nature of the arm_groups / outcome groups introduces a ton of problems for NLP solutions (negation, y or x, placebo etc.)