Because we need to do a lot of post-processing of the data, once we have grabbed it from the (remote) database. For example, before fitting an outcome model we may want to get rid of data of non-informative strata, data that we did need just a few minutes earlier when fitting a propensity model. Yes, we could perform all these operations on the server, but we would get a massive increase in database IO, which currently isn’t our bottleneck, but would surely become so immediately.
Other reasons I like to do analytic steps locally:
- I like one data pull to get all the data from the remote server, and then do all processing locally. This vastly increases stability (I think).
- We officially support 4 or 5 different database platforms, unofficially a dozen or so, with very different SQL dialects. We can make our life a lot easier if most complex operations are done locally, in a uniform environment, rather than in the remote database jungle.