Assuming we all lived in @Kevin_Haynes’s magical world, and we had everything recorded in the data from the very beginning of life till death, I say, scenario 2 would be the way to go and we can develop reusable phenotypes only through modeling a clinical idea. We might not actually need any phenotyping if we ever lived in that world. The truth is in the CONDITION_OCCURENCE table. We can just use that. Until we get there, I am with @Kevin_Haynes on the side 1 of the argument.
The journey from a clinical idea to a fully validated and reusable phenotype is not straight forward. It requires a lot of decisions, each with its own consequence. Actually, reminds me of the black mirror movie. Each decision can put you on a different path and different permutations of the decisions made along the road leads to different versions of a phenotype. I believe that the development and evaluation of reusable phenotypes should be informed by a pre-defined set of inputs coming from the clinical idea and the condition specific information (onset and duration of the condition, severity, etc.), the use case/flavour, the context of use and the type of data that will be used for the development, evaluation, and further use of the phenotype. Otherwise, we risk developing rather complicated algorithms with limited use in network studies.
I would like to walk you through my train of thoughts when it comes to developing a phenotype to show you where I am coming from:
Where do I start?
I know the clinical idea and I have also learned from the phenotype development tutorials that it all begins with 1) identifying the lego bricks (conceptsets) and 2) a building instruction (logics). So how do I identify the legos and come up with the building instruction? I use a cheat sheet, a list of questions that help put me find my way. This list is not exhaustive and definitely does not capture everything but it is good enough to put me on the right track.
1. Lego selection:
-
What conceptsets to create? for example, is a disease conceptset enough? Or do I need to create conceptsets for symptoms, diagnostic procedures, treatment, differential diagnosis, visit context or predisposing conceptsets,? And of course, the desire to include to any of these means a series of other follow up questions to further clarify the path. What symptoms? What treatments? What procedures? and …
-
What concepts to include in the conceptset? for example, should I use history of disease, or should I avoid it? How about the complication of the disease? How high should I go up the hierarchy?
2. Building instruction/logic?
-
What is the index event? Disease or symptoms of the disease or the treatments or the diagnostic procedure?
-
What inclusion or exclusion criteria? for example, should I include or exclude differential diagnosis? Do I need a second occurrence of the index event during the follow up? Within what timeframe from the first occurrence do I expect to see the second one? How about a procedure work-up prior to condition? What is the minimum acceptable length of diagnostic workup before diagnosis and re-diagnosis?
-
What is the exit date? for example, what is an optimal time relative to index that a person is considered with the disease?
All of a sudden, I am faced with so many different lego options and many more building instructions. All of these choices should be made for a reason: what am I trying to optimize for? Am I aiming for a sensitive definition or a specific one. Do I care about index date misspecification and to what extent can I live with the potential misspecification? Or do I need very clear cohort exit date? To come up with the most appropriate combination, I need to know this optimization strategy. The clinical idea alone does not give me enough information. For example, without knowing that the phenotype should be used for identifying patients with incident breast cancer in a set of cancer registries, I would not know if I should include concepts such as history of breast cancer or create a conceptset for diagnostic procedures for the disease and use that as an inclusion criterion, or if two diagnoses codes of the disease should be considered? This information is coming from the use case, the context of use and the type of data being used for the development, evaluation, and use of the phenotypes. Without that, I am left with so many possibilities and little idea when I am done and what is good enough. This impacts the efficiency and transparency of the process and further reusability of the phenotype I am creating. Likewise, as a future user of specific phenotypes, I will be faced with a huge degree of uncertainty when it comes to selecting the right phenotypes and assessing its utility in correctly identifying the clinical idea in my data source(s).
That being said, I agree with @Patrick_Ryan that option (c), using different algorithms on different databases, to answer a question in different databases is problematic. I think we can develop phenotypes using a systematic and structured approach with clear description on what they are trying to captures based on all these input. At the time of the study, researchers can use this information to assess if the phenotype can be used in their databases and choose the appropriate databases to be included in the study. This way, the reason for exclusion of certain databases can be reported clearly and any tweaks to the existing phenotype can be identified easily.