NOTE_NLP guidance

Piper-Ranallo · December 5, 2025, 9:26pm

Hello,

We’re extracting patient-reported data from notes and are wondering what OHDSI guidance is around the following:

Should the type concept id used in the corresponding clinical tables be ‘NLP’ (OMOP4976931), ‘patient self-reported’ (OMOP4976938), or ‘EHR note’ (OMOP4976904)?
Is there an expectation that the relationship between the record in the clinical table and the record in the NOTE_NLP table be recorded (for example in the FACT_RELATIONSHIP table)?

Thanks in advance for any guidance,
Piper

Christian_Reich · December 8, 2025, 3:56am

The purpose of the Type Concepts is not to somehow describe your data or allowing for back tracking. It’s purpose is to make an educated guess about the quality of the data: sensitivity, specificity and timeliness. And the different sources have different track records, but it is still a fuzzy thing.

Feel free to declare FACT_RELATIONSHIPs as much as you want. But the typical phenotyping will ignore those entirely (with the possible exception of mother-child linkage).

Piper-Ranallo · December 8, 2025, 1:40pm

Thanks @Christian_Reich, this is helpful.

I have a related question about how to represent ideas extracted from notes for which there is no single pre-coordinated concept?

For example, ‘symptom x is side effect of drug y’ AND ‘symptom x is attenuated by drug y’.

(As reported by the patient - not indications and known side effects or adverse events - but rather observations about what a patient experiences)

What is OHDSI guidance for how to best represent this in both the NOTE_NLP table and clinical tables?

The drug and symptom are straightforward - a single record in NOTE_NLP for the drug and symptom and a single record in the clinical tables (drug_exposure and condition_occurrence).

How do define not just that there is a relationship between the two clinical observations, but the nature of the relationship - specifically, that the relationship is ‘has side effect of’ and the converse is side effect of?

Here’s my best guess, but I know @MPhilofsky would say this is an “off label” use of the FACT_RELATIONSHIP table. I am also struggling to understand best practices around use of the modifier_list field. I suspect it’s not “legal” to create syntactic expressions to represent the nature of the relationship between two concepts in the field:

I see in this thread that @rtmill (Mar 2023) proposes extending the range of allowable values for the relationship_id field in FACT_RELATIONSHIP. What he’s proposing is the only method of the many we tried to represent the relationships ‘has side effect’ / ‘is side effect of’ (and ‘attenuates’ / ‘is attenuated by’) between a substance and a sign-symptom-disorder.

Again, any advise greatly appreciated!
Piper

Christian_Reich · December 11, 2025, 4:58pm

@Piper-Ranallo:

Yes, you can FACT_RELATE a whole lot of things. But nobody will care, most likely. Why? Because those are not facts. They are assumptions. In fact, our RWE will generate associations, causal or not, between them. So, if a drug has a side effect is unknowable to the provider (even though often there is only one plausible interpretation of the situation). Only the empirical estimation will establish this properly by adjusting for all confounding factors, of which there can be many, known and unknown.

So, again, feel free. But if I were you I would make my life easier and just ignore those cross-links.

Piper-Ranallo · December 11, 2025, 6:10pm

Thanks @Christian_Reich

This makes sense for side effects.

What if our goal is to capture:

observations about symptoms or conditions a patient perceives as being the direct cause of a substance
observations about symptoms or conditions a patient perceives a substance as ameliorating (e.g., patient uses substance alcohol to relieve social anxiety; patient uses illegally obtained Adderall to relieve fatigue and depressed mood)

Again, the data we’re capturing is more about patient perception - specifically patient perception about positive and negative effects of a substance.

That substance could be a medication prescribed for the patient, a social drug (alcohol, tobacco), recreational or street drug, a dietary supplement, or even a prescription medication not prescribed to the patient but being used by the patient.

We’d like to capture patient reported substances, and observations about patient-reported perceived positive and negative effects of these substances.

Edit:

Using the observation table, we could potentially do the following:

observation_concept_id = new or existing concepts for the ideas of ‘patient reported condition attenuated by substance’ OR ‘patient patient reported side-effect of substance’
observation_value_concept_id = clinical finding attenuated or causes by substance

But again we’re back to the issue of needing to link the drug to the observation.

The existing tables just don’t provide the structures we need to create a single observation record to express the ideas we are trying to express.

Piper

Vojtech_Huser · December 12, 2025, 8:23pm

You are asking the right questions.
Use of NLP outputs in phenotypes is rare. OHDSIers did not post enough examples of constructing cohorts or running studies on top of NLPed constructs.

Please post the choices you eventually decided to go with. E.g., Confidence threshold for putting rows in event tables was 80+%. We chose this type concept.

I know that some may actually chose to keep NLP data outside of event tables. (e.g., legacy phenotypes don’t specify claims type concept and it interferes with past cohort and study definitions). But I think fixing legacy definitions is better solution to this challenge (compared with keep-NLP-out-of-events solution).

OHDSI needs more multi site studies on top of NLPed content to better solidify the conventions.

Also Achilles lacks measures for NLP domain (it only has 2200 and 2201 measures for note table (nothing for note_nlp) (also for episode, btw)

here Achilles/inst/csv/achilles/achilles_analysis_details.csv at main · OHDSI/Achilles · GitHub

Evidence network does not collect NLP related data either. (I think)