Best practices for evaluating impact of ETL design choices

willhalfpenny · October 31, 2023, 12:25am

Dear OHDSI Community,

I’m currently working with the Eye Care and Vision WG to design ETLs for semi-structured fields in ophthalmology. Specifically, we are designing parsers to transform free text fields into structured representations. The fields are generally well-behaved but, as expected, there are abnormalities in there use that occur with varying frequency.

We would like to evaluate the impact of specific choices we make, i.e. a simple parser vs a more complex one. Are there any best practices in terms of metrics and methods for evaluating ETL (or NLP) implementations?

To highlight our thoughts so far: 1) there are some simple things we can do, such as row counts, prevalence agreement etc. 2) The impact will depend on the downstream observational studies that are run. We’ve considered doing a form of sensitivity analysis, running the analysis on two different versions to check. It’s also worth noting that we already have information on the rate at which particular instances are occurring.

Any advice from the collective expertise would be greatly appreciated!

Mark · October 31, 2023, 3:40pm

If the pareto principle holds true, and with my experience (your millage may very) it does with semi-structured data, you should be able to get 80% of your data with a series of regex’s. "Remove" the valid data found with regexs and see if you need to do any additional work.
If your records are like our medical records, I suspect that there will be a non-linear increase in work to get additional data; test and find out.

As a caveat, our behavioral records do not follow this general rule, but that is to be expected.

willhalfpenny · October 31, 2023, 6:28pm

Thanks Mark!

You mentioned testing to find out. Do you have a recommended approach to doing this? We’re not sure exactly on the best metrics to use.

Mark · October 31, 2023, 6:48pm

Before I try to answer that, are you doing your ETL in SQL? if so, what flavor? If programming language, which?

willhalfpenny · October 31, 2023, 7:32pm

At the minute, we’re actually just looking at extraction algorithms for specific rows in the source database - we’re not loading data into target schema. We have the set of values entered into a given field in our data, and are writing code to transform these into OMOP fields (i.e. it’s currently language-agnostic).

For example, in our visual acuity field we might have text like “20/20” or “20/20 + 1” or “CF”. Our task at present is defining the function from string to value_as_x.

Mark · October 31, 2023, 7:44pm

In that case lets think in SQL terms (unless it is older SQL SERVER; it lacks common RDBS functions); true languages are more powerful and faster, but takes longer to code. this should work with any language, you just may have to use files instead of temp tables.

make temp table a_1 and load in several thousand rows of randomized data(to make sure you have enough randomized data to get trends without being too slow). make sure it has unique key.
make temp table a_2 with unique key from a_1 as a dependency, ~~sequence~~ regex_term column and data column.
make temp table r_1 as regex terms.
for each term in r_1, extract from a_1 into a_2.

Once these steps are complete, examine what data is left in temp table a_1. This should allow you to add more regexs to r_1 or to even see if the regex method will work.

willhalfpenny · October 31, 2023, 8:05pm

Thank you, this makes sense. My question was actually more specifically on the metrics for evaluating: in this instance, you have mentioned “examine what data is left in temp table a_1”. Is this then examining the set of strings missed by the filter?

This we can certainly do, but we were wondering if there were further approaches to determining how to interrogate such missingness i.e. how we can characterize the subset of strings that are / aren’t mapped to evaluate it’s clinical utility?

A qualitative example would be to say, for example, researchers are unlikely to ever use the modifiers for visual acuity “20/20 + 1”, so we can instead just parse out the first bit to “20/20” and be done (for vision researchers reading this, I’m not saying this is the case!). Are there best practices for doing this type of thing, or similar, systematically?

It may be that there are no standardized approaches to this, just wanted to check!

Mark · October 31, 2023, 8:07pm

Yes, exactly!

This question seems it would be better answered by someone whom works with optometry data as I do not know that domain.