Stability of Impala or Hadoop-based DB

SCYou · July 23, 2018, 11:41pm

Hi all,

Someone with large-size health care database asked me to use possibility of hadoop-based CDM.
I answer that Impala will be the best option for this.

Still, they’re concerned with stability of Impala.
Does anyone have experince using Impala with large-size data? (more than 10 TB?).

Thank you

Mark_Danese · July 24, 2018, 12:20am

We have found that the newer releases of Impala are very good. Run-time filtering seems to have sped things up quite a bit. We use Impala on some ~6TB data sources. However, we are doing extremely complicated SQL, and Impala seems to do well with it. As with any rapidly evolving technology, it is important to stay reasonably current, and to engage some experienced data engineers to address specific use cases.

In full disclosure, this is our own software, and not OHDSI tools. @shawndolley will know more since he works for Cloudera and created the Impala working group within OHDSI. I know that he and others have done some good work on adapting the OHDSI tools to work with Impala.

SCYou · July 24, 2018, 1:27am

Thank you @Mark_Danese. It’s really helpful!

michael.davies · July 26, 2018, 2:35pm

Hi @SCYou

I think that Impala is pretty stable, however, at Iqvia we have found that running Atlas cohort definition queries can take a long time on large data sets in Impala.

For example, a relatively simple cohort definition (Warfarin use for people with Atrial fibrillation over 65 years old) takes around 1.5 hours to run against our largest data set (600m people, 80bn events), on an installation with around 30 servers running Impala.

This data is a few TB in compressed parquet form, when you say your data is 10TB what form is that data in?

The query duration is dominated by the time taken to scan the initial data. Impala does not support indexing, and there is no sensible partitioning strategy that can be used in this case, so you end up doing full table scans. Additionally, it is not possible to ensure person data is co-located when using separate tables.

The performance issue is exacerbated for more complex cohort definitions where more scans are required. It is also possible that shuffling the intermediary results can take significant time, but I think this depends mostly on the specificity of the concepts sets used.

We have begun some experimentation with a nested OMOP data model, where person data is co-located, which should reduce scans and shuffling. This seems to be a promising approach and when we have some reliable results I can publish them.

Mick

parisni · July 26, 2018, 2:55pm

Korean National Health Insurance asked me to adopt hadoop-based DB for CDM.
I think Impala is the best option for this. But the government is
concerned about the stability of Impala because the data size would be
so huge.

We are using apache hive on top of ORC (an indexed format alternative to apache parquet) based on hadoop, and it has proven it’s
robustness over years and now is ACID compliant (just like impala/kudu
AFAIK). All three LLAP, facebook presto (known to use this internally) and apache spark deliver a in
memory fonctionality on top of apache hive (ORC format) that let
envisage a very broad future for that entirely opensource set of technologies.

Cheers,

Mark_Danese · July 26, 2018, 6:18pm

@michael.davies

We have also experimented with a nested data model on Impala. This can change the queries. We have also then written views on top of the nested data model so that the original queries work. Unfortunately, we ran into issues in creating large nested datasets on our tiny cluster, so we couldn’t do any real performance testing. But our preliminary findings on small data were favorable. We would love to hear more about your results since we have not had the time to really get into this.

Mark

Chris_Knoll · July 27, 2018, 4:47am

On our MPP systems (Redshift, PDW), the co-location of data is critical. SqlRender is even built to create DISTRIBUTEION=HASH(PERSON_ID) commands on create tables (with a PERSON_ID) so that data can be co-located. If you can figure out a mechanism to get the data for a person co-located then that should help immensely.

I am not sure I heard correctly at one of our F2F, but I got the impression that the implementation using Hadoop/Impala was modeling their data strictly around the CDM’s domain table structure (like clustering out the tables in some way, but not recognizing the connection between the domain tables and the person table). Instead, I think the best design is to consider a person a ‘hub’ in a star-configuration, and the other domain tables are references back to PERSON, and you partition all these satellite domain tables around individual (or a range) of people. then, finding patient-related data is always local.

michael.davies · July 27, 2018, 10:48am

@Chris_Knoll There is no mechanism in Impala to distribute tables so that related data in different tables are co-located. This is one of the reasons people use nested types. https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html contains a section describing this.

Nesting person related data can reduce shuffling but as @Mark_Danese says it requires non-trivial changes to the query SQL.

We are putting resources into looking at this now and so we will hopefully have some results later this year.

Are other people running OMOP tools on large data sets in Impala? I don’t really have a feel for how much it has been used

gregk · July 27, 2018, 11:00am

Impala is a fantastic capability to have. It allow to optimize the data management process quite a bit

Here are some findings from our side:

For Impala to work well in Big data space, one should be very good with platform architecture and ensure an appropriate size cluster that would fit the data needs. If not, it will not be forgiving…
For Impala to work well in Big data space, one should understand how SQL queries should be built to take advantage of the cluster architecture. It is not a rocket science and if done properly - it works really well. If not, Impala will not be forgiving… Simply taking a SQL statement built for another traditional RDBMS will most not likely just work. As a part of ATLAS / SQLRenderer work, the Odysseus team has been continuously working on optimizing the SQL generated for Impala

gregk · July 27, 2018, 11:06am

hey Michael - we have done multiple ATLAS deployments on Impala - there are multiple customers in Pharma that use it with very large data sets.

As a result, we are continuously working on optimizing ATLAS Impala queries. As you correctly statet, these are often non-trivial changes and, unfortunately, with the approach ATLAS / WebAPI to SQL renderer it might impact other SQL dialects. So, we are carefully looking at every SQL and trying to find a good balance. Should we team up with you on this?

michael.davies · July 27, 2018, 1:34pm

@gregk Yes that would be great. I’ll get in contact. Thanks

pavgra · July 27, 2018, 3:53pm

@michael.davies, some of issues which I’ve found while trying to make cohort SQL running using small Impala cluster and big dataset were posted here: https://github.com/OHDSI/circe-be/issues/45. Three of them (all except for the first one) are more general logic issues which affected all DBs and they were resolved by @Chris_Knoll, while the first one doesn’t seem to be solvable in a generic way (if we do what I proposed there, it will ruin usual RDBMS performance).

I also thought of nested data structure, because today, even in classic RDBMS, significant amount of time is spent for joining on time intervals. So if we could enable such structure where a person nests observations and those nest drug_exposure, condition_occurence, etc, I believe we’d get real performance gain. What is more, the same data structure could be used e.g. for Solr / Mongo.

Mark_Danese · July 27, 2018, 6:17pm

For what it is worth, we have had good success by using temp tables to reduce the complexity of queries. We actually run most of our protocols using a memory limit of 30 GB, even on the 5 TB MarketScan data. This is on a ~20 node cluster. Just to provide additional context, we extract the data for most protocols in 30 minutes to 2 hours. This starts at the raw data (OMOP v 4.5 but no cohort table), creates the cohort, gathers baseline variables and outcome variables and saves all the data to 2 output tables (one with one record per person and the other with multiple records per person). In fact, much of this works decently on a 6-node test cluster, albeit a little slower.

michael.davies · July 31, 2018, 8:11am

@Mark_Danese - you said at the top of the thread that you are running your own software on the data and not OHDSI tools. Is that because you are doing specialized stuff or because of performance?

clairblacketer · July 31, 2018, 12:43pm

If you all have any suggestions on how to improve the Impala ddl, please let me know or post an issue to the github. Also, in upcoming versions of the CDM, would any of you that use Impala be willing to be a tester prior to release?

Mark_Danese · July 31, 2018, 1:55pm

We have our own software because we approach the design and implementation of protocols differently. The software has been a project that has been going on for 5 years, before the OHDSI tools even existed.

Christian_Reich · August 1, 2018, 11:44am

@clairblacketer: This is going to be a longer thread until these guys can land something. Not ready for Github. But we should watch over the shoulder.

That’s like before the Big Bang. Nothing knowable exists from those prehistoric times.

michael.davies · October 24, 2018, 8:48am

Here is an update on some results we had running cohort generation on a cluster of 3 machines each with 24 hyperthreaded cores and 256g memory.

The OMOP dataset we tested against had 88M persons and about 5BN events in total, and was a little over 200GB in compressed parquet files.

Using a nested model and a purpose build UDF in Spark to generate the cohort we could generate cohorts in around 10 to 15s. The compares with using Impala on flat tables which took around 10x longer.

The resulting cohort from Impala differed slightly to the Spark cohort, and we tracked this down to how Impala treats timestamp fields. See https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html for details, when we set the flag -convert_legacy_hive_parquet_utc_timestamps=true to fix this issue, Impala cohort generation performance degraded significantly, cohort generation taking over 30 minutes. This is a known issue - see https://issues.apache.org/jira/browse/IMPALA-3316. So it is clear if you are using Impala you need to be very careful about how you generate your source parquet files.

Mark_Danese · October 30, 2018, 1:57pm

Hi Michael:
Can you describe how you created the nested data structure (or provide a little code)? We have done it, but we typically can’t do anything other than tiny datasets due to memory issues. Maybe its the cluster, or maybe we are just doing something silly. I think the nested structure has a lot of promise on Impala and would like to test some larger databases using it.