Why use an SSH tunnel?
If you are researching the topic of using an SSH tunnel to access Azure HDInsight, you would have stumbled upon this article on why and how to set it up.
Here is another reason why you would want to SSH tunnel in HDInsight. Let’s say that you want to establish an ODBC/JDBC connection via SSH Tunnel to the Hive Server 2 endpoint on your HDInsight cluster. My reason for doing this is very specific to an issue – 502 Bad Gateway error returned from HDInsight cluster when I execute a Hive query. If you have the same issue, please continue reading.
What is the root cause of 502 Bad Gateway Error returned from HDInsight?
502 Bad Gateway Error can be returned from the HDInsight internal central gateway. It is an expected behaviour from the HDInsight central gateway when there is a connection timeout of 2 minutes before it disconnects a connection that is considered inactive. When the Hive client initiates a request, say via a Hive ODBC/JDBC driver, the central gateway acts as a reverse proxy and routes the request to the Hive server component in the HDInsight cluster. Once the Hive server starts processing the request, a response (not the query results) needs to be sent back to the client. If that does not happen within the timeout of 2 minutes, a 502 Bad Gateway error is returned to the client.
Due to the intermittent nature of the issue I faced, it is possible that some queries result in Hive server are taking more than 2 minutes to respond. Data-refresh operations can take some time depending on a number of factors including data volume, level of optimisation using partitions, etc. Running data refresh through the Hive ODBC driver could result in often unreliable, long-running HTTPs connections.
The 502 Bad Gateway Error can be mitigated by bypassing the HDInsight internal central gateway. Bypassing the HDInsight central gateway is possible by establishing an SSH tunnel between the host which runs the client (via the Hive ODBC Driver) and the HDInsight head node. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution. There is an article from Hortonworks which explains how establish an ODBC/JDBC connection via SSH tunnel.
The Hive query will continue without 502 Bad Gateway error when it bypasses the HDInsight central gateway via the SSH tunnel. In order to perform query optimisation, a reference is available here: https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
HOW TO create SSH tunneling
Find out what is the transport mode of your HDInsight Hive Server 2
Go to Ambari -> Hive -> Configs -> Advanced
Look for hive.server2.transport.mode. If the value is HTTP this means Hive Server 2 is running on port 10001 on the head node. If transport mode is binary, it is running on port 10000 on the head node. This is the Hive Server 2 port used for both ODBC and JDBC connections.
Please make sure that hive.server2.transport.mode = http because the Hive ODBC/JDBC driver uses HTTP to connect to the Hive Server 2.
Create an SSH tunnel from Hive ODBC client host to the HDInsight head node by doing this
ssh -L 10001:[insert head node FQDN or IP address here]:10001 sshuser@
[insert head node FQDN or IP address here]
In the ODBC Data Sources (64-bit), configure a user DSN as the following:
In order to disable SSL, you must select “User Name and Password” as the authentication mechanism.
You must disable SSL because you are connecting to a private/internal endpoint of your HDInsight head node
You have just bypassed the HDInsight internal gateway using SSH tunneling.