@Andrew: This is a good example of a common misperception about ‘information loss’ that comes from vocabulary mapping. @hripcsa’s recent JAMIA paper nicely summarizes this issue and demonstrates that its not nearly the problem people worry it is.
To the specific example in diabetes: let me illustrate the idea with ICD-9-CM 250.12 - “Diabetes with ketoacidosis, type II or unspecified type, uncontrolled”, which currently maps into 2 SNOMED standard concepts: 40482801 - Type II diabetes mellitus uncontrolled, and 443734 - Ketoacidosis in type 2 diabetes mellitus. Here, there is no information loss but the ICD9 concept has been disentangled into its two components. To your question, there could certainly be relevant clinical use cases for when a clinician only wants to study ‘patients with T2DM’ and other relevant clinical use cases when a clinician only wants to study ‘patients with diabetic ketoacidosis’. Both of these use cases are fully supported, whether you conduct your analysis off your raw data and create ICD9 codelists or whether you use the CDM and create standard conceptsets.
Now another example: ICD-9-CM 362.0 - Diabetic retinopathy. This only maps into SNOMED to one standard concept: 4174977 - Diabetic retinopathy. Again, no information loss. But importantly, if I’m a clinician interested in the first use case, finding ‘patients with T2DM’, then I should consider whether a patient with a diabetic complication is sufficient evidence that the patient has diabetes. That’s just a phenotype evaluation problem, and is no different if you did a source analysis defining diabetes as ‘250x’ (which would miss some complications) or if you create a standard conceptset of ‘Diabetes mellitus’ and all descendants (which in SNOMED, will not ‘roll up’ all associated complications).