OHDSI Home | Forums | Wiki | Github

Regex for sig

(Melanie Philofsky) #1

Colorado’s OMOP instance is a fully identified dataset, but we need to deliver a de-identified dataset to one of our researchers. The sig is necessary for their research. We have tried to de-identify the sig, but it still contains PHI. Since the sig follows a general grammatical pattern: Verb (take, apply, inhale, etc.), Amount (1, 15 100), sometimes a Unit for the number (mg, puffs, drops), Frequency (daily, twice a day, once a week), etc. Knowing this, we (beg/borrow/reuse) could white list keywords for each category, prepositions, and other “safe” words. Has anyone attempted to do this? Any lessons learned? Pointers? General help?

(Mark Danese) #2

I am pretty sure that this is how CPRD does their prescription data. You might search for information about how they do it.

(Sigfried Gold) #3

My Master’s thesis was a tool for extracting structured script info from Sig’s based on a big hierarchical regex. It’s been cited regularly over the years, and some of the citing papers might have better techniques, but it is open source, including the big yaml regex: https://github.com/Sigfried/merki/blob/master/drugParseRules.yaml.

Let me know if you have any questions about it.

(Christian Reich) #4


It looks cool, but do you have some kind of description of what it does and how it does it? It starts with the words “trumps” and “trumper”. I only know “Nevertrumper”. :slight_smile:

(Chris Roeder) #5

@Sigfried_Gold That’s an interesting project. Is the gold standard data available?

(Sigfried Gold) #6


@Christian_Reich: the linked YAML file is a set of hierarchical regular expressions – but not in a particularly helpful order for reading. Those two trump entries mean: 1) if the same chunk of text matches the expressions for both drug and possible drug, process it as a drug; and 2) if the same chunk of text matches both date and number, date should win.

Probably the best place to start reading would be at line 171:

   - drug
   - possibleDrug
   - context

This instructs the parser to process only text chunks that match the expressions drug, possibleDrug, and context. drug is then defined at line 82 as

- name: drug
    - 'drugname(\b\s*drugInfo)?'
    - 'doseOf\b\s*?drugname(\b\s*instructions)?'

which says that a drug can be a drugname followed by zero or one drugInfo OR zero or more doseOf followed by drugname followed by zero or one instructions.

Based on that explanation, if you understand regular expressions, you should be able to understand most of that file, EXCEPT drugname, which is mysteriously defined as

drugname:           ['D\d+D+']

This is because drugnames matched from the supplied druglist (extracted from RxNorm in 2008) are replaced in a first pass through the text being processed with a more easily matchable drug ID and that weird expression matches one of those drug IDs. (If you’re curious, some explanation is available in the parsing code.)

There’s also some helpful info in the readme and the paper.

@CRoeder, the gold standard data was a bunch of narrative text chunks from EHRs which were marked up by two physicians. I’m pretty sure we didn’t have an IRB and I believe we were working from a de-identified clinical data warehouse, so it’s possible that the data wasn’t protected by HIPAA. I just did a cursory search and couldn’t find any traces of it, and I can’t remember looking at it after finishing the project. So, probably not. …


(Chris Roeder) #7

@Sigfried_Gold Thanks for looking. It wouldn’t be too hard to come up with some short test demo data. @MPhilofsky would probably have to develop one for her data/corpus anyway.

(Chris Roeder) #8

Here’s links to two tools that might be useful here.

The UMLS annotator could be used to identify things you want to let through by choosing the appropriate ontologies and leaving others out. https://bioportal.bioontology.org/annotator

Amazon has a service “detect PHI” that could be useful. https://docs.aws.amazon.com/comprehend/latest/dg/how-medical-phi.html

(Christian Reich) #9

We have largely ignored the de-identification problem of personal health data in the OHDSI community. There is a reason for that: The design of the OMOP CDM is a fully normalized relational model, where all data are represented through concepts. These concepts are defined centrally and distributed through the mandatory reference tables of the OMOP Standardized Vocabularies. Any personal information in a dataset like “Dr Reich” or “Huron Avenue” cannot be in that reference table, because folks who build that have no access to such data and therefore cannot possibly encode them into concepts.

Of course that is not entirely true. There are a few places in the model where we let those original snippets of information sneak through:

  • All the fields ending in source_value
  • The NOTE table
  • All the fields of the location table except location_id
  • The provider_name, npi and dea fields in the PROVIDER table
  • The value_as_string of the OBSERVATION table
  • And the SIG field in the DRUG_EXPOSURE table

Instead of trying to remove or redact that information, which always leaves behind some remnant worry of “Did I really manage to get it all out?”, we should think of closing those loopholes. That would be a very cool way to solve that problem and generate a de-identified dataset for anybody who has to. It would be easy: drop all the contaminating fields above and the NOTE table. And resolve the SIG field. I’d claim that the utility of the data would suffer only marginally, as analytics tools like ATLAS don’t use them at all to create meaningful research.

So, coming back to the debate here about resolving the SIG field: @Jeremy_Warner has worked on a nice model for SIG-type information in the context of regimen information in HemOnc. I am thinking we should make this generic and add it to the CDM. The rendering of the information from the free-text SIG fields, like @Sigfried_Gold has attempted, would be part of that. As a side effect we would also solve the durg dose problem we still have. Any appetite?

(Sigfried Gold) #10

I have no idea if my tool is still worth using, but I will say, in case anyone is considering it: it was built to operate on free-text narrative records and to extract embedded sigs and convert them into structured script fields. The problem of working directly on sigs is, of course, easier, and the hierarchical regular expression model is probably an excellent way of handling it. If anyone wants to make that code part of an OHDSI-compatible library somehow, I’d be delighted to help – either as a volunteer consultant to help whoever’s doing the work understand the code, or as a paid consultant if they’d like me to do the work.

The paper has been cited a lot, and I used to get occasional queries about using the code, but I don’t know if any of those went anywhere. It’s licensed under GPL3, so if anyone does want to include it in a non-GPL context (like OHDSI), they’ll at least need my permission – and I’ve been waiting all these years for someone to ask… Anyway, I’d be very excited to see new life breathed into it.

(Christian Reich) #11


This thing is so cool! You built a whole context-free grammar in yaml to be utilized in a Perl parser. Old memories are coming up.

So, you are splitting out ‘drug’, ‘possibleDrug’, ‘context’, ‘dose’, ‘route’, ‘freq’, ‘prn’, ‘date’. If we wanted to model that information out into a computable SIG model, what would we capture:

  • drug and possibleDrug we would have to map to a RxNorm Concept and store into drug_concept_id.
  • Context we’d probably through away, right?
  • route would have to be mapped to a Route of Administration Concept and put in route_concept_id.
  • date would go to drug_exposure_start_date.

Which leaves us with dose, freq, prn? Is that all?

BTW: does dose also capture liquid forms?

(Melanie Philofsky) #12

OHDSI is an amazing community :slight_smile: Thank you all for the replies.

@Sigfried_Gold I read the ReadMe. Programming languages aren’t my strong suit, so this may be a naive question. The ReadMe states, " $parser->twoLevelParser goes over its input twice: once to extract drugs, possible drugs, and contexts; and a second time to find, within each drug or possible drug, the dose, route, frequence, prn and dates".

Is it required for the parser to find a drug/possible drug in order to find the dose, route, frequency, prn and dates? Or will it extract the dose, route, frequency, prn and dates without a drug name?

I haven’t seen any drug names in our sig data, but I haven’t look at the 100’s of million rows. And I don’t think it would be in the sig because the data already contains the drug code & name. Our sig is more along the lines of “Take one pill every 8 hours for 10 days” or “Apply twice daily until symptoms are gone”

(Sigfried Gold) #13

Hi @Christian_Reich and @MPhilofsky.

Partly answering parts of both your questions: the way a drugname is identified is just by its presence in a drug list. Just identifying a string as a drugname is sufficient to identify it and any surrounding text that matches other part of the regex as a drug. A possibleDrug only matches by finding clues in the surrounding text that indicate that that part of the text is where a drugname would appear if it did match something in the drug list.

So, @MPhilofsky, your sigs just have all the information except the drug name? My system should definitely help with that since it’s an easier problem to solve than when you’re trying to find possibly misspelled drug names in the midst of narrative text. If you could share a sampling of your sigs, I could play with them a bit and see what happens.

@Christian_Reich, I can’t remember how much context the parser was really able to identify, but I think it was able to say if the drug appeared in a ‘Discontinued’ section, a med list, maybe a history section. The eventual goal (never finished) was to include the parser in a more general system for doing medication reconciliation on medical records, which was difficult partly because of the extreme redundancy of med lists scattered throughout all the different physician notes–mostly cut and pasted from a past list and sometimes modifieds.

Date could go in the drug_exposure date, but there might be better evidence of the date in the structured data than what appears in the text. – Oh, but I’m also remembering, there were a lot of date ranges, which might have been saying when a patient had been on the drug.

To both Christian and Melanie: we could try the parser as-is on whatever kinds of data you have, but it might also make sense to scavenge it for parts if the jobs you want it for are simpler. You’ll see a lot of possible patterns that are commented out around line 90. Some of these might have been commented out because they were producing too many false positives in the ‘training’ runs. The point is, this language was meant to be flexible and allow configuration according to your needs and data.

(Christian Reich) #14

I totally get it. And we will have to finish your unfinished job! :slight_smile: You have to.

But what I am asking now is a step before that: What are the things we need to be able to capture that’s coming from those records if we want to get rid of the free-text (and therefore pretty useless) sig field? And change the model accordingly? Is the above list right? OR is there more? And what are the Concepts we need?

(Sigfried Gold) #15

Ok, so you want to capture every smidgeon of meaningful information in the sig before discarding it.

The answer (as usual) is still “it depends.” You can configure the expression you use to capture anything you anticipate being able to capture from the sig – and you may also want to capture any remaining text that wasn’t matched by the fields you anticipated finding. The commented-out lines below here show a number of patterns that I ended up not using in the study but that might be useful depending on what shows up in your sigs.

But, let’s just take the non-commented-out patterns, and assume

  1. the sig doesn’t contain any context information, and
    A. it doesn’t contain the drug name (@MPhilofsky’s use case, I think?) , or
    B. it does contain a drug name that’s a string match for some drug you have a concept_id for, or
    C. it contains a drug name not in your list (which you’re willing to capture without a concept_id)

then, with all the patterns in lines 84, 85, 93, 94, and 95, you could be matching any of the following (read ==> as ‘resolves to’):

drugInfo ==>
    dose ==> 
        form => [tablet, cap, drop, ointment, solution, etc] (yes to your q about liquid forms, btw)
    instructions ==>
        route ==>
            manner ==> [po, iv, drip, sl, npo ...]
            where ==> [ad, right ear, as, left ear ...]
        freq ==>
        prn ==>

Sorry, I don’t have time to finish this right now, but, yes: route, dose, freq, prn pretty much covers it – the only thing I think you’re missing would be instructions regarding dates or duration. And I think my intention in capturing date was not so much about start date as about these sorts of instructions.

And I don’t know what Concepts we’d need. Are you hoping to break all this down to the point that you could do actual calculations and analyses on all these components? For instance, is it enough to capture dose as a string, e.g., 200mg tab), or would you want separate fields: [amount=200, unit=mg, form=tab]? You can do it either way – but a lot of testing and refinement would need to happen before anyone should expect to get all those subfields really reliable.

Anyway, do look at the list of context things I linked above just to verify that you don’t expect to find any of that stuff in the sig. Otherwise, I don’t think there’s more.

It’s fun to get to think about all this again.

(Christian Reich) #16

Exactly. That allows us to do daily dose. We’d solve that problem for good.

I’d love to have a parser based on some defined grammar, and over time we would get it better and better.