Clarification on Ambiguity Between PROCEDURE_OCCURRENCE and MEASUREMENT Domains During ETL and NLP-based Entity Extraction

lokesh · October 16, 2025, 12:31pm

We are implementing an NLP-driven ETL pipeline to populate OMOP CDM v5.4 and have encountered ambiguity between the PROCEDURE_OCCURRENCE and MEASUREMENT domains. The main challenge arises when clinical text mentions a test or procedure (e.g., “biopsy,” “urine culture,” “echocardiography”, “glucose measurement”) without an accompanying result, or when both the act and the result appear together (e.g., “Echocardiography shows LVEF 40%”).

While the OHDSI convention says that “Lab tests are not a procedure; if something is observed with an expected resulting amount and unit, it should be a Measurement,” real-world cases are often mixed or incomplete. We seek community guidance on the best practice for such situations… whether to record these acts as Procedures, Measurements, or both, and how to handle cases where results are implicit.

Also, we would like to confirm whether it can be concluded that if results are not mentioned, the record should not appear in the MEASUREMENT table i.e., it should only be represented in PROCEDURE_OCCURRENCE until a measurable outcome is given.

Any guidance, references, or examples from existing ETL implementations will be deeply appreciated.

Thank you for your time and valuable insights!

Christian_Reich · October 16, 2025, 1:12pm

@lokesh:

I think you hit the nail on the head. We have a gray zone with diagnostic procedures that can yield a result (or more). They really are both. We need to create a convention for that. @MPhilofsky?

In my opinion, the solution is to put in the procedure into PROCEDURE_OCCURRENCE as is. And for each of the results we place a MEASUREMENT record. The big question here is: What should the measurement_concept_id be?

MPhilofsky · October 16, 2025, 2:51pm

Agree, @Christian_Reich, we need to write a convention for this and NLP generated data in general since these data come as uncoded data within a note.

@lokesh
The Procedure is “records of activities or processes ordered by, or carried out by, a healthcare provider on the patient with a diagnostic or therapeutic purpose.”
Measurement is "structured values (numerical or categorical) obtained through systematic and standardized examination or testing of a Person or Person’s sample.

The MEASUREMENT table “contains both orders and results of such Measurements as laboratory tests, vital signs, quantitative findings from pathology reports, etc.”

The difference as stated on the CDM v5.4 conventions for the Measurement table as linked above: “Measurements are stored as attribute value pairs, with the attribute as the Measurement Concept and the value representing the result. The value can be a Concept (stored in VALUE_AS_CONCEPT), or a numerical value (VALUE_AS_NUMBER) with a Unit (UNIT_CONCEPT_ID). The Procedure for obtaining the sample is housed in the PROCEDURE_OCCURRENCE table, though it is unnecessary to create a PROCEDURE_OCCURRENCE record for each measurement if one does not exist in the source data.”

If all you have is the procedure per your example:

Then create a Procedure record IF you have a use case to know the procedure was ordered/completed. If you don’t have a use case, don’t add a record. It’s just clutter; adds rows to the table slowing down ETL, data quality checks, analysis, etc.; takes person time to de-bug ETL issues, data quality checks, unexpected analyses results, etc. and, most importantly, isn’t useful since you don’t have a use case for these data

If you have the procedure and the result per your example, then create both a procedure and measurement record. If you don’t have both, there is no need to “infer” the procedure since we already know a person had to have the blood draw procedure in order to have results for the blood sample.

Per @Christian_Reich question:

The measurement_concept_id is the standard concept_id for the NLP text. This is going to be a little difficult for the NLP because many standard concepts for lab tests are very granular, see this list of standard LOINCs, and you should never map to a more granular concept especially with labs since the result is dependent on the specific lab.

For Lokesh’s example:

The top fuzzy, string matches in Athena are too granular for the whole term. But if we divide it into the procedure and then the measurement result, I find one acceptable standard concept_id for the procedure = echocardiography and the measurement result, which needs to be spelled out since the ‘LVEF’ acronym doesn’t yield good results, give us a good match with this standard concept_id. And once again, if you have a use case to know the LVEF was measured via echocardiography versus transluminal device, then create the Procedure record. Otherwise, skip it. Matching uncoded, free-text data to a standard concept_id will be difficult enough, don’t double your work load.