Significant Digits

Trampas · April 29, 2022, 3:35pm

I have noticed that OMOP has no concept of significant digits. For example a patient might be asked when they were diagnosed with a disease and they answer 1984. Baring the obvious problems with date/time in OMOP there is no way to store a date and indicate the significant digits should be at a year level.

This also comes into play with measurements, that is every piece of measurement equipment has significant digits and when you store data as a floating point number you loose that. For example “0.10” measurement really means the significant digits are +/-0.005 typically, but when you store as a floating point number it could be 0.999999999999 which looses all indication of the significant digits.

Can a significant digits field be added to dates, and measurements ?

Trampas

Eduard_Korchmar · April 29, 2022, 4:01pm

Hello, Trampas! Is there a usecase for significant digits that is not already covered by introducing intervals (both for date-time and measurement values)?

Trampas · April 29, 2022, 4:07pm

Not sure, where can I find more information about ‘intervals’?

zhuk · May 2, 2022, 12:42pm

Hello, @Trampas

You can get more information about intervals here, on Wiki: OMOP CDM v5.4

Check range_low and range_high fields.

Trampas · May 4, 2022, 4:08pm

So the intervals do not help with significant digits. For example imagine you have a person’s weight and the the range_low is 0lbs and the range_high is 1000lbs. This does not tell you if a person who weight is 100lbs is measured to the precisions/accuracy of +/-1lb or +/-100lbs.

Here is more details on significant digits: Significant figures - Wikipedia

So here is an example use case patient says they were diagnosed with HIV in the summer of 2002. So this is an observation as no measurement was made, what date do you put down for the date? There is no way to put down summer 2002 using intervals.

Another example a measurement of is done for a parameter like cholesterol in mg/dL, the interval should be the limits for high cholesterol threshold. For example total Cholesterol of 180mg/dL where limit is 170mg/dL. However you do not know if the measurement is done to +/-10mg/dL or not. This is where you need significant digits, it is metric on the capability of the machine doing the measurement.
Note of Cholesterol it may not be a huge relative error, but for other measurements it can be very significant, especially when it comes to research.

Trampas

Chris_Knoll · May 4, 2022, 7:17pm

It is and it isn’t. I agree that significant digits is for indicating the number of digits that are used (so 3 significant dignits for the speed of light is 3.00x10^8 m/s while 5 would be 2.9979 x10^8 m/s. So, you could say that significant digits could be seen as some indication of uncertainty, but the intervals are (in my view) really what you are talking about.

For example, you state that the measurement was done +/- 10mg/dL and therefore some use of significant digits, but would you make the same claim if it was +/- 7 mg/dL?

I also agree that there’s some problem with unit and measurement standardization in the CDM in that: when you look at a body measurement, is it always in KG (a standardized unit), and how many significant digits are you using (so you don’t have to bother going with 10.39482 if everything is standardized to 2 decimal places). But I don’t think that’s what you’re really going after here.

I think the intervals you’re using describes confidence around the actual value. To use your ‘sometime in the summery 2022’, I’d represent that with intervals starting at the midpoint of summer and +/- the half the number of days in summer. That interval covers every day in summer…you don’t have confidence about when it is, but if you said ‘June 1’ or ‘August 1’ both would be ‘in the range’ to hit this observation.

I think considering the confidence interval along with the significant digits is a bit entangling (as I said above: you can specify that we only care about 2 decimal places when recording weight, and that’s not making any statement about confidence interval, except you could have a .005 error. You then could say that all of our weight intervals are +/- .005 but i think the point of the uncertainty is to account for device that took the measurement.

Disclaimer: I am not a professional lab tech / material scientist so some everything above is just musings of a data scientist, but I do handle a lot of complicated code, and dealing with comparing values with arbitrary precision/confidence intervals greatly complicates the data-model and the processing logic to deal with those data considerations.

Trampas · May 4, 2022, 7:23pm

The problem with range_high and range_low is they are not confidence intervals, they are documented as thresholds for normal measurements. Therefore, if this is what is meant by “intervals” it does not apply as they are thresholds for normal range of measurement.

As far as I know there is no way to provide a confidence interval on a date or a measurement value. If there was a way to provide confidence intervals it would work for what I need.

Note that confidence interval, significant digits, quantization level, standard deviation are all methods of trying to define the error of a numeric value based on noise model. There are many more methods depending on the type (model) of the error. The point is that OMOP needs a way of qualifying the error.

Trampas

zhuk · May 4, 2022, 8:19pm

it is the way to put it down using historical concepts. Since you don’t know the real date of the diagnosis, and it was 20 years ago, it became a history. So use History of clinical finding in subject and HIV as value. You can also play with these and these types of OMOP Extension concepts, that were introduced to represent range of possible dates for historical events. You will just need some calculations.

Well, this is really a characterisation of test. So you may play with statistical values from snomed, for example, and create a fact relationship between real test and it’s characteristic. Not an easy way, but still a way.

Also, you may change types of range_high and range_low to varchar and store values like ‘100.00’ there. Probably, the worst solution, but still doable: it is up to you to decide. However, I am not sure if analytical package would be affected by this solution.

You may also be interested in wide mapping table. It is in construction currently, but it is expected to have exactly what you are asking for: field named Error.

Hope this will help you with your usecase.

Trampas · May 4, 2022, 8:50pm

@zhuk Thanks for the information.

Although the data is a characterization of the test (measurement) it is important when presenting data. For example if the measurement value is 999.9999 should the user interface round to 1000 or 1000.00 or show 999.9999. As an example if the measurement is something like white blood cell count then the value should be shown as a whole number (ie a count). Therefore currently OMOP requires external knowledge of the measurements to present properly.

For a measurement we could create custom concepts for significant digits and store them as concepts and then for each measurement put in a concept relationship. However that would be a bad assumption on our part as that it assumes every measurement with that concept_id has the same significant digits. For example user weight would always be shown in whole kg, which would be bad for a infant weight.

For dates it is much harder problem. If we need to store dates with a range there is no good solution I have found other than to maybe use the date and date_time to make a range. That is maybe store the date as 1-1-2022 and date_time as 2-1-2022 to show that the date was sometime in January 2022. This would not be ideal.

Another option may be to implement fact relationship where we create a custom concept code for various date ranges and use fact relationships. This gets messy as well, and again no one else can really follow it.

Maybe the best option is to just add significant digit fields to our database. Then we can support the standard set of OMOP fields. I am wondering if Atlas and other OMOP tools still work when you have extra fields in tables?

Trampas

zhuk · May 4, 2022, 9:20pm

Store it in grams then One more solution: convert units so you won’t have a single digit after the ‘,’

To be fair, I still don’t understand why you need significant units for all cases you’ve described. I do understand the value of significant digits in different fields.

For HIV if you don’t know a date, it is a history. And history can be covered quite precisely with all those ‘…within 3 months’ and ‘…no longer than 6 months’ - between these 2 concepts are 3 months window.

I hope you will find a solution and post it here for all of us to learn something new. Good luck!

Trampas · May 5, 2022, 2:58pm

So the application is to have forms in OMOP for a patient. Here the desire is to have the forms where they do not need to know the exact date. As an example “when did you break your leg?” “Fall of 2000.” So we have to store the date with some way to note that the date is not exact.

Another issue is when we have user interface and we ask for patient information. For example weight, here the user might enter weight as lbs. For infants they might enter 6.40lbs, for adults they might enter 120lbs. Here when they return the form and review the data we want to show realistic significant figures. That is floating point numbers are not exact, so 6.4 could be 6.39999 and 120 could be 120.00001. Here we could always show two digits after decimal place or one digit. However to make the UI clean we should store information about the significant figures.

This applies to measurement as well. For example for some measurements they want to apply a threshold to measurement. For example if value is over zero. Well floating point might store a number as 0.0000001, so this technically is over zero. However if we know that the significant figures we might round back to zero and now the threshold of greater than zero is false.

I realize that most people never have this problem and if you picked units multiplier correctly (kg, verse grams, etc) then you are usually good enough to show two decimal places on values. However this is not always the case.

Basically anytime you store a floating point number, you really need to have some means of noting how the number should be rounded (ie the significant figures), because floating point numbers are not ‘precise’. This is why you should never compare floating point numbers as being equal. 12.9999999 is not the same as 13.000000. However if you know that the possible error is 0.01 then within the error tolerance they are the same.

Christian_Reich · May 6, 2022, 5:15am

@Trampas:

Obviously you totally have a point. We have no concept of precision. The problem is we are secondary users of data collected during healthcare, and therefore no influence over its generation. We are just passing on the data from the EHR or other sources and ETLing them into the OMOP CDM. The “poor ETL schmock” (as they are sometimes compassionately called) has no way of fixing this, even if we introduced some mechanism. So, it is up to the analyst or analysis tool to make assumptions.

The good news is that precision in medicine is generally low. Data are crude. There is no meaningful difference between 6.4 and 6.41 lbs. And all users of the data make assumptions about this low precision when they do their work.

The date situation is different, though. We do need to think of how to handle this better, because our standard resolution is one day, and today we cannot go lower (we can go higher with datetime). There is a discussion already under way. Please join.