Scripting Hive Commands with Python
In the previous posts, we touched upon basic data processing using Hive. But it is all interactive. We will discuss how to script these Hive commands using Python.
We will start with a very basic python script and add more functionality to it by the time we reach the end of this post.
We will ensure the environment is setup correctly before getting into the scripting.
If running Python 3 or below then install the following packages
pip install pyhs2
pip install thrift_sasl=0.2.1
pip install sasl==0.2.1
if running python 3 and above you might face SASL errors , in this case turn SASL off in hive and follow the below method :)
On Error - Tsocket error with Python 3+ , Or Could not start SASL , Or TProtocol error related to SASL...
pip3 install sasl
pip3 install thrift
pip3 install thrift-sasl
pip3 install PyHive
Turn SASL off in Hive and use the pyhive python libraries
edit the hive site xml file guide path : /etc/hive/conf.dist/hive-site.xml
add the following entry below in the file and save.
<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>
Restart the Hive 2 Server so that the new setting can take affect.
To start HiveServer2:
$ sudo service hive-server2 start
To stop HiveServer2:
$ sudo service hive-server2 stop
We are good to start our scripting part
For this post, we have created a Hive table table_test with two columns and loaded with very few records.
The first script we will attempt to create will do – Connect to Hive, Execute a Hive command – a select statement and show the output on screen
We are using Python 3.4.2 version and will be utilising pyhive for python
Hive Output to Screen
from pyhive import hive
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_test")
for result in cursor.fetchall():
print(result)
The first line of the code imports the python library that we are going to be using for this post and going forward.
Using this connection handle we then execute a simple Hive Command (Select Query in this case)
That is our very first Python script. Of course it is very simple and shows the results on to the screen. Now in real life situations this is of very limited use. We typically want to store the output of the query to a file. Well, we will add functionality to our code to achieve this
Hive Output to Text File
from pyhive import hive
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_test")
with open('test.txt', 'w') as f:
for results in cursor.fetchall():
f.write("%s\n" % str(results))
Notice there is no change to import the library, create the connection handler (lines 1 – 4).
This time we will open a file (in the current directory, but if you specify the full path, you can create the file in any location you wish) and loop through the results and save line by line – Lines 5-7
When I execute this code as python3 hive_data_to_file_simple.py I see a file created in my current directory with the data
Hive Output to csv File using pandas
pip3 pandas
Save the following lines of code as hive_data_to_file_pandas.py file on Linux
from pyhive import hive
import pandas as pd
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
df = pd.read_sql("SELECT * FROM table_test", conn)
df.to_csv('test.csv', index=False, header=False)
When you run
this script the output will be stored as csv in the current directory
Pandas is one of the most widely used Python libraries in
data science. It is mainly used for data wrangling as it is very powerful and
flexible, among many other things.
So now we know how to script Hive commands. You can export the
results to file system, Google Bucket (we will cover this more in depth in
future posts) or any other data source if you have the right libraries.
Bye for now from Krishna and Shahed
Note: only a member of this blog may post a comment.