At some point, they'll crop up, depending on what your dbms setup is, and how much you can use things like partitioning and indexing to help you out. (For instance, if you have lots of records for the same drug that document dose changes, it may be slower to retrieve specific doses of that drug than if you had an equal number of total doses spread across more drugs.) But I'd expect that to happen in the >100M row territory on decent hardware.
This is a great question, and there've been discussions on and off about "flavors" of the CDM that address needs that don't make sense for everyone, but we're not at any kind of standard practice. In PEDSnet, what we try to do is make additions that leave us compatible with the canonical CDM (sounds like
effective_drug_dose may become one of these ) so as not to interfere with logic that expects the regular CDM. Early on, we shied away from even adding columns to canonical tables, but as time has passed, we've become more comfortable with that; there aren't a lot of OHDSI practices that depend on extra stuff not being there. We can sometimes add content in a similar way. For instance,
drug_era records at the ingredient level aren't adequate for some of our core use cases (where in/on the patient did that prednisone go?), but it doesn't interfere with CDM logic to add era records based on Clinical Drug Forms (and if Clinical Dose Form Groups ever show up in the vocabulary, they'll be even better ), because canonical logic looking for ingredient-level
drug_concept_ids just won't see them.
The harder problems, of course, are when you make choices that're incompatible with the canonical CDM spec. To pick one we've been discussing recently in PEDSnet, we've just decided to not follow some OHDSI domain "routing" for diagnoses, because we were seeing diagnoses codes like "Elevated blood glucose" routed to
measurement, and, well, they're diagnoses not measurements, and putting them in
measurement because they were a judgment about some other measurement created a bunch of problems for us. Another case is where we deploy a data element in PEDSnet and OHDSI later picks up that or a similar element (such as
birth_datetime; for some historical reasons PEDSnet had first deployed that as
time_of_birth). We'll migrate over to the canonical definition over time, but it requires retooling ETL pipelines incrementally, and in the interim we'll be non-compatible.
In these latter cases, I see best practice as being driven by the analyses you plan to do. If your goal is to be able to use tools like Achilles and Atlas, but you're not participating in a bunch of OHDSI community studies, then your compatibility requirements are narrower. If you want to participate in studies with other folks who have OHDSI CDMs, you can agree on common extensions (this is effectively what PEDSnet does across sites), or materialize the canonical and custom versions as you note, or sometimes split the difference using tricks like views to make available the flavor with the less granular definition. This kind of analytic balance also impacts where you put new stuff: on a pragmatic level, the question may become whether it's less pain on balance to add stuff in new columns/table and take a join when you want to get to it vs. putting it into the existing structures but not having canonical logic from elsewhere be able to deal with it.