Tuesday, 31 October 2017

Scripting Hive Commands with Python

October 31, 2017 Hadoop and Hive

Scripting Hive Commands with Python

In the previous posts, we touched upon basic data processing using Hive. But it is all interactive. We will discuss how to script these Hive commands using Python.

We will start with a very basic python script and add more functionality to it by the time we reach the end of this post.

We will ensure the environment is setup correctly before getting into the scripting.

If running Python 3 or below then install the following packages

pip install pyhs2
pip install thrift_sasl=0.2.1
pip install sasl==0.2.1

if running python 3 and above you might face SASL errors , in this case turn SASL off in hive and follow the below method :)

On Error - Tsocket error with Python 3+ , Or Could not start SASL , Or TProtocol error related to SASL...

pip3 install sasl 
pip3 install thrift 
pip3 install thrift-sasl 
pip3 install PyHive

Turn SASL off in Hive and use the pyhive python libraries

edit the hive site xml file guide path : /etc/hive/conf.dist/hive-site.xml

add the following entry below in the file and save.

<property>
   <name>hive.server2.authentication</name>
   <value>NOSASL</value>
</property>

Restart the Hive 2 Server so that the new setting can take affect.

To start HiveServer2:

$ sudo service hive-server2 start

To stop HiveServer2:

$ sudo service hive-server2 stop

We are good to start our scripting part

For this post, we have created a Hive table table_test with two columns and loaded with very few records.

The first script we will attempt to create will do – Connect to Hive, Execute a Hive command – a select statement and show the output on screen

We are using Python 3.4.2 version and will be utilising pyhive for python

Hive Output to Screen

I have saved the following code as hive_data_print.py file on Linux

from pyhive import hive
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_test")
for result in cursor.fetchall():
  print(result)

The first line of the code imports the python library that we are going to be using for this post and going forward.

We define a connection handler to the Hive Database poc_hadoop and initiate an instance of this connection (lines 2 & 3)

Using this connection handle we then execute a simple Hive Command (Select Query in this case)

We loop and print row by row (We could do this as the data set is small) – Lines 5 & 6. Note the Print syntax could be different depending on the version of Python you are using

I run the script as python3 hive_data_print.py

And you should see the rows displayed on the screen

That is our very first Python script. Of course it is very simple and shows the results on to the screen. Now in real life situations this is of very limited use. We typically want to store the output of the query to a file. Well, we will add functionality to our code to achieve this

Hive Output to Text File

Save the following lines of code as hive_data_to_file_simple.py file on Linux

from pyhive import hive
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_test")
with open('test.txt', 'w') as f:
  for results in cursor.fetchall():
    f.write("%s\n" % str(results))

Notice there is no change to import the library, create the connection handler (lines 1 – 4).

This time we will open a file (in the current directory, but if you specify the full path, you can create the file in any location you wish) and loop through the results and save line by line – Lines 5-7

When I execute this code as python3 hive_data_to_file_simple.py I see a file created in my current directory with the data

Hive Output to csv File using pandas

Now we will try to enhance our script to be more versatile using panda data frames. You need to install pandas for this, which Shahed has done using

pip3 pandas

Save the following lines of code as hive_data_to_file_pandas.py file on Linux

from pyhive import hive
import pandas as pd 
conn = hive.Connection(host="43.32.0.85", auth='NOSASL',database='poc_hadoop')
df = pd.read_sql("SELECT * FROM table_test", conn)
df.to_csv('test.csv', index=False, header=False)

When you run this script the output will be stored as csv in the current directory

Pandas is one of the most widely used Python libraries in data science. It is mainly used for data wrangling as it is very powerful and flexible, among many other things.

So now we know how to script Hive commands. You can export the results to file system, Google Bucket (we will cover this more in depth in future posts) or any other data source if you have the right libraries.

Bye for now from Krishna and Shahed

Note: only a member of this blog may post a comment.

Translate

Tuesday, 31 October 2017

Scripting Hive Commands with Python