This is a great example to illustrate the difference between classification
and prediction I think.
As phrased: among a target population of women of child-bearing age who
visit a doctor and have not yet been diagnosed or treated for
endometriosis, which patients will go on to be diagnosed with endometriosis
in the next year? the question appears to be a prediction question. But is
it? (@noemie, @Patrick_Ryan)
Do we really care about our ability to predict that another human (the
doctor) will figure out that the patient already has the condition and
will document it as such. i.e. ‘will diagnose with endometriosis next
year’. If we solve that problem, we set up an index date, and only use data
prior to the index date to make our prediction about the doctor figuring
out the existence of the underlying disease.
I’d argue that the notion of an index date is a distraction in this
setting, and we should use all the data about an individual to figure out
if they have the condition (endometriosis) and we have not yet figured that
out. i.e. solve a classification problem using the entire patient timeline
to find patients that are being missed.
Thoughts?