DrugBank

rkboyce · January 16, 2016, 7:45pm

On the Laertes project we are trying to bring in the Vigibase data into the evidence base. Vigibase using WHO-Drugs and there are licensing concerns to work around. However, there seems to be a way that we could use DrugBank to get around the issue. There are many other projects that would also benefit from having DrugBank drug ids loaded into the Standard Vocab. This has been discussed before and I think it was decided to move forward but then was put aside because of other priorities. Here I am officially requesting that the DrugBank drug IDs be loaded and also suggesting how.

My lab has worked with DrugBank a bunch on our Linked SPLs project (https://dbmi-icode-01.dbmi.pitt.edu/linkedSPLs/). As a result, we have a high quality mapping of DrugBank drug ids to RxNorm that was created using the FDA’s UNII system, INCHI keys, and some hueristics. A mapping created with DrugBank 4.1 (current DrugBank is 4.3) is available in the LAERTS github repository:

The process can be more or less automated to run as DrugBank provides updated XML files. I can provide those details to folks to help get this process going.

best,
-R

Christian_Reich · January 17, 2016, 10:37am

Rich:

You got it. We’ll put it in. We’ll probably have a few questions when we start.

Christian_Reich · January 17, 2016, 10:57am

@rkboyce

Actually, here is the first one: Looks like you only published the final mapping tables, not the scripts. True? If so, they will invevitable get stale. Usually, we publish in Github all the sripts that make the mapping tables, so they can be re-run. Sometimes there are manual steps in them (add new mappings between certain things). Thoughts?

rkboyce · January 18, 2016, 3:20pm

I posted the link to the old google code repository with the mapping
scripts. I will ask my programmer to filter that down to the essential
components and post them on Github. Would you like them under
https://github.com/OHDSI/Vocabulary-v5.0 ?

Christian_Reich · January 18, 2016, 3:31pm

@rkboyce:

Got it. We need to figure out a way how we incorporate that into the vocab building process. How? Reverse engineering is not popular.

rkboyce · January 18, 2016, 3:41pm

Where is your process documented. It will probably be quicker for me to
learn your process and port our scripts than to have your team reverse
engineer.

rkboyce · January 20, 2016, 5:22pm

@Christian_Reich - Well, I felt bad for about a second when I did not get a reply and thought to myself “maybe Christian does not want to answer because it is clear on the Vocab GitHub site how the vocab is generated…” So, I checked out the link to the process on the Github page and was greeted with “You’ve followed a link to a topic that doesn’t exist yet.”

So, scanning a few sub-folders in the Vocab repository tells me that your preferred approach is heavily SQL driven and requires that a new vocab source be loadable into a DBMS and then mapped through SQL commands. Our current process for creating the RxNorm to drugbank is not hard (see below).
I could image that, to get into the Vocab, the same process could be used with an additional step that would load the file output in step 4 into the Vocab and then use the RxNorm codes in the table to map to DrugBank IDs using OMOP concept source codes:

Get the DrugBank update - it is provided as a large XML file: http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip
Get the FDA Substance Registration System UNII table update - it is provided as a CSV file: http://fdasis.nlm.nih.gov/srs/jsp/srs/uniiListDownload.jsp
Using the following Python script, we load the UNII data and Drugbank data and then create the mapping using INCHI keys and exact string matches. The script can be ran from the command line with an option to create mappings using drug name synonymns, or not: http://swat-4-med-safety.googlecode.com/svn/trunk/linkedSPLs/ChEBI-DrugBank-bio2rdf-mapping/scripts/parseDBIdBySynsInchiName.py
Some simple linux commandline parsing produces the final CSV data file e.g., see http://swat-4-med-safety.googlecode.com/svn/trunk/linkedSPLs/ChEBI-DrugBank-bio2rdf-mapping/UNII-data/README

There you have it. What are your thoughts about how we should proceed?

thanks,
-R

Christian_Reich · January 21, 2016, 2:08am

@rkboyce:

Yikes. The page moved. But where? I’ll get back to you.

Christian was just travelling around like nuts. But this page actually is pretty clean and does explain how to do it.

Actually, let us take a look. You are right, the current process is heavily SQL-based. Reason is that we use a ton of standard scripts to do the housekeeping: proper life cycle (deprecation, upgrades), mapping rules, referential integrity of various kinds. If we had to redo this for every vocab we would drown. So, I think we should take a look at what you are doing and then talk about options.