OHDSI Home | Forums | Wiki | Github

NLP Type Concept ids

Hi Friends!

I am working with NLP data that originated as Provider Notes. These data will be mapped to the appropriate Clinical Event tables. I would like to distinguish the provenance of these data from other EHR data. However, I do not see appropriate type_concept_ids for this. Is this something we can add? It’s important for the researchers analyzing the data to know the provenance!

1 Like

Good-good-good question! I looked at the Types and wasn’t able to find anything appropriate.
Agree that we need to distinguish them: it’s not as reliable as EHR derived records, but far better than self-reported conditions/observations.
So I’d definitely add types for all of the domains.
@Christian_Reich does “Provider Note derived” sound good to you?

I’m not sure this is what you meant, @MPhilofsky
But I proposed to add two columns to Note table for relating the note with the provenance .

1 Like

Thanks for the suggestion, @SCYou, but we are taking the parsed NLP data from provider notes, mapping them to the semantically equivalent concept_id and mapping the concept_ids into the appropriate tables as directed by domain_id of the concept_id. So, a parsed NLP provider note with “family history of breast cancer” would map to the OBSERVATION table.

In order to keep this post relevant to my original topic, please see my suggestion to your GitHub post here

2 Likes

I would suggest having a type called “NLP derived”, and we could use the fact relationship table to link it back to the note table and note_nlp table.

2 Likes

I haven’t heard any objections to my proposal. So, I put it on GitHub. Please continue the conversation there.

1 Like

I don’t think moving extracted information from notes into dedicated fact tables is the best idea.

  1. Information extracted from notes often comes with a tool and a performance. The OMOP facts table (such measurement) cannot store that information, and this is misleading analysts.
  2. Extracted information sometimes point to familly history, or patient history, and this cannot fit into facts tables.

That’s why I would find more valuable to extend the note _nlp table to be able to store structured information as can do any domain fact table, by adding columns such value_as_number, estimated_date, estimated performance etc
In addition, if one read carefully the note_nlp specifications in the documentation[1] section 4.1 we can see that the value_as_number columns was initially in the note_nlp table, and for obscure reason is not in the table right now.

[1] https://docs.google.com/document/d/1ykYVJTQ5MuI7eh_Nk7xzt44EzNjVs71nq2LIsC_RlOg/edit

1 Like

@parisni
Can’t agree with your second statement: family history and personal history perfectly fit Observation table.

Melanie’s case is different. She has structured data (the only difference from the other tables is the absence of standard codes like ICD10 or CPT4). So it can be considered as simple usual free-text, which we convert into OMOP tables.
In order to be able distinguish it from the other “reliable” facts, we populate types. That’s why this field is obligatory.

@parisni:

Couple points:

Unfortunately, that is true for structured information as well. What we get from the normal EHR is as flawed and comes with less than 100% precision and sensitivity. So, an analyst who doesn’t want to trust the NLP can easily exclude it through the Type Concept.

But you are right, there are facts like numbers, and they disappeared from what was released. Ask @noemie, @HuaXu or @rimma.

The “tool” is stored in nlp_system.

Not true. The CDM documentation states how to handle those: as Observations.

Hm. That table contains all the interim results the NLP creates. If we add all the other tables to this one we are blowing the model out of proportion, and if an analyst wants to use that information together with the structured (which is the goal anyway) he needs to go to two places.

What’s the use case you are looking at anyway?

can you please elaborate on this, because that’s not obvious to me.
How this statement would fit in observation:

[quote]Father had type 2 diabete[/quote] ?
Moreover what about performances of the tool and versionning of extraction ?

I believe the NOTE_NLP is flawed in its violation of the central OHDSI/OMOP tenant that every fact normalized to a standardized vocabulary belongs in one and only one domain/table. NOTE_NLP is a tolerated heretic within the OHDSI/OMOP church. I think we should modify the NOTE_NLP table by removing the note_nlp_concept_id and note_nlp_source_concept_id columns and replace them with a polymorphic foreign key (like the COST table’s cost_event_id/cost_domain_id):

note_nlp_event_id (NOT NULL, integer) : A foreign key identifier to the event (e.g. Condition, Observation, Measurement, Procedure, Visit etc) record that the nlp note represents.

note_nlp_domain_id (NOT NULL, varchar(20): The concept representing the domain of the note nlp event, from which the corresponding table can be inferred that contains the entity for which note nlp event information is recorded.

And then use @MPhilofsky proposal to track provenance within the type_concept_id column in each corresponding clinical event table.

This would confine the NOTE_NLP table to the proper function of recording the metadata details of the NLP extraction/derivation and leave the representation of clinical events to its proper compatriots

This would be a step in direction of making OMOP/OHDSI less focused on putting ‘reliable’ EHR/Claims data in one place and ‘unreliable’ NLP/abstracted/curated data in some dark corner.

All right. the documentation is clear enough for family/past history condition that goes into observation.
Indeed there is a subset of ICD codes for which domain_id = Observation (6705 codes versus 196701 that point to Condition).

Good point. Then that should be clearly stated in the documentation. Right now, there is no mention whether the note_nlp should get staging/interim results or final one’s.

Then how about linking the fact (observation, measurement and so on) that derive from note_nlp with the fact_relationship ? This would allow to get the performances, tool, and versioning information back.

Yes, that’s a use case we wouldn’t support right now. The problem is that this information is hard to standardize. If you want to do it inside your organization you needn’t the standard, if you want to do it across the network (which could be a useful exercise indeed) you need to talk to the NLP group and bring them over the hoop.

1 Like

Remember that NOTE_NLP was designed as a temporary table, say two years, while we assess in-the-field accuracy of NLP results for research before we clutter everyone’s core clinical tables. Interesting thought to keep NOTE_NLP to hold metadata long term. But not clear anyone really wants to do research on the metadata. Once we feel comfortable with NLP output, might just drop the metadata (just as we currently drop all current extra metadata for structured data).

Fine, but there is no mention of this “temporary” nature of the table in the documentation (this documentation enhancement work should be done by the NLP working group ?). I have uses cases where those metadata shall be kept. Think textmining and reproducible research. It is crucial the table is not empty regularly. It is crucial program written based on some version of some metadata can still be run.

1 Like

We use the NLP-derived concept_types for drug, condition, procedure, and measurement. For some reason there is not one for observation. You can find them on Athena.

1 Like

Oh yeah. When adding them totally forgot about observation. Will fix.

1 Like
t