@Dymshyts As you might have guessed, this example comes from NAACCR. So it is less than perfect:
NAACCR #2800 CS Tumor Size
See here for its origin:
CS Tumor Size | CS Data SEER*RSA)
I think it is poorly worded and actually means “between 1 cm and 2 cm”. So I don’t think it is trying to represent all numbers.
As you can see from the link, this is a question that has a mix of of categorical and numeric answers.
Code |
Description |
000 |
No mass/tumor found |
001-988 |
001 - 988 millimeters (mm) (Exact size to nearest mm) |
989 |
989 mm or larger |
990 |
Microscopic focus or foci only and no size of focus given |
991 |
Described as “less than 1 centimeter (cm)” |
992 |
Described as “less than 2 cm,” or “greater than 1 cm,” or “between 1 cm and 2 cm” |
993 |
Described as “less than 3 cm,” or “greater than 2 cm,” or “between 2 cm and 3 cm” |
994 |
Described as “less than 4 cm,” or “greater than 3 cm,” or “between 3 cm and 4 cm” |
995 |
Described as “less than 5 cm,” or “greater than 4 cm,” or “between 4 cm and 5 cm” |
999 |
Unknown; size not stated Size of tumor cannot be assessed Not documented in patient record |
When NAACCR #2800 CS Tumor Size has a categorical value as its answer, I think we should give up on trying to represent its answer numerically in the MEASUREMENT table and simply record its value/answer in the value_as_concept_id column. When its answer is between 001-988, I think we should represent it as a number in the value_as_number field.
If we try to represent categorical values like ‘992’ numerically, we will need to manually curate/annotate the NAACCR vocabulary to an unmanageable degree. Especially over time.
Further, these kind of edge case answers seem to fall into uncertain cases where likely the value of the underlying data is highly suspect.