Connecting Data Quality Dashboard with Impala or Spark

ranatech · January 11, 2024, 9:15am

Hi All,

Need your help and advise on setting up the DQD (Data Quality Dashboard) with Impala or Spark. I referred to the link “Connecting to Various Database Platforms • DatabaseConnector” but could not get much information from the setup perspective for Impala or Spark. My preference would be first on Impala and second on Spark. Would be great if anyone can please share their views on this.

Please note that our cluster is Kerberized hence need to ensure that the setup would be working accordingly. We are using Cloudera CDP product.

Any help in regards is much appreciated.

Best,
Rana

katy-sadowski · January 12, 2024, 1:07am

Hi! See the DQD Getting Started page for setup instructions: Getting Started • DataQualityDashboard. Populate the createConnectionDetails object according to the instructions for your platform in DatabaseConnector here: Connecting to a database • DatabaseConnector

ranatech · January 13, 2024, 9:57am

Thanks for the pointers katy. However, it seems I am facing some issues as I am getting the below error while trying to connect via jdbc url.

Unable to connect JDBC to jdbc:hive2://server.ip.address.com:port_no/DB_Name;principal=hive/_Host@server;serviceDiscoverMode=zookeeper;ssl=true;zookeeperNamespace=hiveserver2
JDBC ERROR: Could not open client transport for any of the Server URI's in Zookeeper:javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException:unable to find valid certification path to requested path.

Any idea on how to resolve this? I am trying to set up the DQD tool in a kerberized environment within CDSW tool. Thanks in advance for the help.
Best,
Rana

katy-sadowski · January 14, 2024, 5:19pm

Unfortunately I’m unfamiliar with Impala and the infrastructure you’re referencing so I won’t be much help in debugging this. You might want to search Forums and Github issues for others mentioning Impala, or hope an Impala person sees your post here

I also forgot to flag that Impala is now considered a deprecated database platform and is no longer officially supported in the HADES ecosystem (Supported Database Platforms). So you might want to try Spark instead!

ranatech · January 15, 2024, 2:08am

Thanks so much for the looking into this Katy. Appreciate it!!

Also, I will surely look for Spark configuration with OHDSI DQD tool within my CDSW R 4.1 environment but would be great if you can share any pointers that I might need to be aware of before exploring this option. Thanks again for your inputs.

Best,
Rana