I think that Impala is pretty stable, however, at Iqvia we have found that running Atlas cohort definition queries can take a long time on large data sets in Impala.
For example, a relatively simple cohort definition (Warfarin use for people with Atrial fibrillation over 65 years old) takes around 1.5 hours to run against our largest data set (600m people, 80bn events), on an installation with around 30 servers running Impala.
This data is a few TB in compressed parquet form, when you say your data is 10TB what form is that data in?
The query duration is dominated by the time taken to scan the initial data. Impala does not support indexing, and there is no sensible partitioning strategy that can be used in this case, so you end up doing full table scans. Additionally, it is not possible to ensure person data is co-located when using separate tables.
The performance issue is exacerbated for more complex cohort definitions where more scans are required. It is also possible that shuffling the intermediary results can take significant time, but I think this depends mostly on the specificity of the concepts sets used.
We have begun some experimentation with a nested OMOP data model, where person data is co-located, which should reduce scans and shuffling. This seems to be a promising approach and when we have some reliable results I can publish them.