Thursday, 26 September 2019

Yarn - Long Running Jobs Alerts for Hive TEZ , Application Containers

Hadoop / Hive / Yarn / Applications Long running Jobs alert script

Currently at a client where their are some long running jobs in hive that hog the resources , You can see the logs in the application history manager or yarn etc. But we needed an alert mechanism. So we wrote this python script to detect any over-runs , it will spit out a file with the relevant records that are over running . The script can be run with any method really.Airflow . ODI , Cron Job just add a step to send a mail using xmail or what ever suits you.

We use Airflow and if the file produced contains any records we send an email to the Admins to see whats going on , Alternatively you could always issue a yarn application kill command.

The script is within our Github Python.

Over and Out

Shahed and Krishna

Thursday, 29 August 2019

HADOOP Balancing the cluster

Hadoop Balancing the Cluster

Balancing an hadoop cluster i feel is very important and the nodes should have a deviation of no more than 1%. I just feel as though this helps in having a healthy hadoop cluster , Schedule a job to run maybe once a week on a quiet day. The reason im saying quiet is the command i use stretches the balancer to run very fast 100x faster than the normal way of just running the balancer

The following command - speeds up the balancer to 100x

sudo as hdfs - or the hdfs user.

Firstly set this

hdfs dfsadmin -setBalancerBandwidth 100000000 

then run the following command - you can run as nohup or background job if you want. roughly takes care of 350gb in 5 minutes.

hdfs balancer -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.dispatcherThreads=200 -Ddfs.datanode.balance.max.concurrent.moves=5 -Ddfs.balance.bandwidthPerSec=100000000 -Ddfs.balancer.max-size-to-move=10737418240 -threshold 1

Wednesday, 6 March 2019

Apache Airflow 1.10.2 Maintenance Dags

Apache Airflow 1.10 / 1.10.2 + Maintenance Dags (Logs)

We have noticed on Airflow 1.10.+ , The maintenance dags to manage database and file system logs need to be updated to accomodate extra tables etc.

We have uploaded the new dags to our GitHub Repo :

Friday, 16 November 2018

Google Drive API Google Sheets Extract to CSV with SERVICE KEY.JSON

Recent Client had some Google Sheets stored on their Google Drive and needed them processing into the Hadoop Cluster and available in presto for reporting.

As Google Drive uses Oauth authentication we needed to use a Service Key instead to access the files needed as we are processing them using airflow and a python script on a linux server.

We hunted around and could not find any solid example of a script that works on with the latest libraries. 

This script will extact the google workbook and each sheet will be a seperate csv file. within a given location path.

There is also an alternative script available that can loop over a .txt file and process google sheet ID's one after the next . 

Mandatory Parameters

Service Key and a Google Sheet ID and a path
Service Keys can be generated for a project within google. Then the email address of the service key needs to be used for sharing of the spreadsheet within google sheets

For :
Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory


Make sure your python library for google api is this version ( can be higher but not tested on higher versions)
pip install google-api-python-client==1.6.2
Also make sure all other libraries are installed using pip these are visible in the script under import.
Download Python Script to a directory on Linux.
Ammend the Following
Replace YOURSERVICEKEY.json with the filename of the service key you generated in Google. service_account_file = os.path.join(os.getcwd(), 'YOURSERVICEKEY.json')
This will be your output Path for your csv files (1 per worksheet within a workbook) path = "/GCSDrive/output/"

For : Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Run Command

The parameter is the google sheet ID which is visible in most url's when viewing the sheet in google.
--------------- For running 1 workbook at a time 
python3 1gWxy05uEcO8a8fNAfUSIdV1OcRWxH7RnjXezoHImJLE
--------------- For running Multiple Workbooks one after another 

Git Hub Location for Code :

We will follow through with this post and blog the framework we created to Process these files directly to HDFS and through to Hive and Presto.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!


Shahed Munir & Krishna Udathu

Thursday, 8 November 2018

Hive HPLSQL setup on Google DataProc

Hive HPLSQL setup on Google DataProc

Google Dataproc Hadoop and Hive
Hive Version : 2.3.2 (Version Supports Tez Engine)

More Information on hplsql

As always we are trying to implement some functionality and its not working with a security setup on Hive aswell as the Tez execution engine etc.

Here are the Steps to get it going.

To enable HPLSQL . Cant find any documentation on this for google dataproc anywhere !!

As we have security etc setup on hive you have to edit the hplsql-site.xml to connect to the thriftserver2 for hive and to do this you need to follow the steps below

Find the following File


Copy the file to a folder and unjar it 

jar -xvf hive-hplsql-2.3.2.jar

vi the file hplsql-site.xml (You Will see it when you unjar the above file)

Ammend the hiveconn2 connection and add username and password etc
I have also ammended the config so i can run querys on tez within my specific DB. Its all quite straightforward.

  <name>hplsql.conn.hive2conn</name>  <value>org.apache.hive.jdbc.HiveDriver;jdbc:hive2://localhost:10000;myuser;mypassword</value>
  <description>HiveServer2 JDBC connection</description>
     set hive.execution.engine=tez;
     use myspecificdb;
  <description>Statements for execute after connection to the database</description>

Remove the jar file currently in folder and run the jar command within the folder you are editing the hplsql-site.xml file

jar cvf hive-hplsql-2.3.2.jar *

Copy the jar file back to the original location

cp hive-hplsql-2.3.2.jar /usr/lib/hive/lib/.

Command to test

/usr/lib/hive/bin# ./hplsql -e "select * from mydb.mytablename"

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Shahed Munir & Krishna Udathu

Tuesday, 6 November 2018

PowerBI Custom Direct Query ODBC Connector

PowerBI Custom Direct Query ODBC Connector

At DELIVERBI we have various challenges presented to us and a client of ours uses Power BI as their main reporting and analytics tool and wanted Direct Query on any ODBC Source.
We implemented Presto , Hadoop , hive etc as we outlined in an earlier blog briefly. To compliment this solution we needed to be able to query Presto on a Direct Query for transactional level data on the fly.

We created a Power BI custom connector for this that sits on top of a windows ODBC connection. Its quite simple really but we have compiled the .mez which is the custom connector file that you will require to start importing or Direct Querying data straight away.
Its available to download from our github Location

The Connector Supports all ODBC Datasources and has been optimized for Presto and Hive too. We used the treasure data ODBC driver for Presto as that's what was needed and a mapr 2.15 ODBC driver for Hive. This Connector is generic so any ODBC Driver can be used. Also the connector has been tested on most versions of PowerBI.
Remember to place the .mez file within your windows documents folder under 2 named folders which is the default location for Powerbi to pick up custom connectors.
It will appear under Other within the Powerbi application.

If you would like to know more about compiling the connector or have a custom connector created and delivered to your inbox that's skinned for a corporate look and feel then please contact us for more information.

Enjoy the connector and helps like it did for us. We bypassed the source code etc so business users can enjoy the benefit of a fully compiled Connector , Just store it in the location and start using it for Direct Querying any ODBC Datasource.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Shahed Munir & Krishna Udathu

Tuesday, 30 October 2018

Hive Tez vs Presto 12 Billion Records and 250 Columns Performance

Hive Tez vs Presto as a query Engine & Performance12 BILLION ROWS AND a 250 Column Table with Dimensions. 

Here at DELIVERBI we have been implementing quite a few Big Data Projects. At one of our more recent clients we required speedy analytics using a query engine and technology that could query almost 15 Billion rows of data over various partitions. The main table has 250 columns so is quite wide and the data was roughly 20TB for the one table, On top of this their were joins to various dimensions. This is a more than real world example that will have a user base of around 1000 users querying data over 3 years worth of daily partitions.

We tried various approaches including spark , impala , drill , Hive MR and various other tech stacks.

We settled on the Below Technologies which work seamlessly and have produced outstanding results for our client.

Hadoop, Hive , Presto , Yarn , Airflow for Orchestration of Loading.

20 Node Cluster - All Nodes - 8cpu - 52gb Memory 
We went for the divide and conquer technique and it works !!. 

Lets Begin , I will also include some tuning tips as we go along. Their is a lot more involved but we will skim over the major points.

Hive Version 2.3+ Upwards - We required the TEZ engine here to load data daily and throughout the day to various ORC tables within the Hadoop Filesystem.

TEZ - Its quick , quicker than MR and we used it to load data into the Hive tables. TEZ is prominent over map reduce by using hadoop containers efficiently, multiple reduce phases without map phases and effective use of HDFS. Make sure the tez container size fits within the Yarn Container sizes.

TEZ Container Sizes are important - Roughly 80% of the Yarn Container size. These can be tuned further otherwise you will get out of memory errors.

--------------------------------------------------------------- yarn-site.xml

    <source>Dataproc Cluster Properties</source>

---------------------------------------------------------------  hive-site.xml

After the above settings all memory errors were gone. :)

Also these settings help within the hive-site.xml and we had an increased performance. We have used many more that are in line with our clients requirements.


We have loads of other settings but if you will be using hive as a query engine too. 

Pre warm some Tez Hive containers for future queries and re usability and remove old sessions when they expire etc the settings below will help with this . There are loads more settings that can be used.


ORC table Definition - This one works very well with large data sets.

partitioned by (ETL_DATE date)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY" , "orc.stripe.size"="536870912" 
, "orc.row.index.stride"="50000","auto.purge"="true");

The ORC table with a compression of SNAPPY works very well when loading/querying either in hive or presto and produces results quick.

Google Stripe and Stride for more info . The above settings worked well for us with the data volumes we used and above.

The partitions help when you need to reload data and so does gathering statistics for the tables and columns.

To replace data within a partition we used a Date column.
INSERT OVERWRITE TABLE mytablename PARTITION(ETL_DATE) (Will automatically Wipe the partition(s) and reload data), Insert into 

Before we load we enable the gathering of Table level statistics automatically by setting the below in our hive session

set hive.stats.autogather=true

We also needed to gather statistics for the column level data within partitions we load

analyze table mytablename(etl_date='2017-06-01') compute statistics for columns;

Presto Configurations are soooo easy just set the JVM memory on the Co-ordinator and worker nodes to roughly 70% of the cluster mem size and restart everything.

With the above settings and data volumes the timings are as follows

1. Fact table query with sum and group by whole table - less than 1 minute
2. Fact Table with a date filter - less than 3 seconds.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 seconds

As you can see the timings are phenomenal compared to using traditional RDBMS databases with these kind of volumes and a super wide table.


Its slower but we are using it for Batch Jobs using Airflow for orchestration as manage our yarn queue's and resource capacity on who can run which etl job. We will cover capacity-management within another blog. Their is just soooo much to cover on what we implemented.

1. Fact table query with sum and group by whole table - less than 5 minutes
2. Fact Table with a date filter - less than 2 Minutes.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 Minutes

Pre-Warming the containers take 30 seconds off each query roughly.

We know Big Data as we have been tuning and have built our solutions on various platforms. Its the way forward

Presto is a clear winner and using Hive for our user base that is familiar with traditional sql but with a rocket under the bonnet made it clear winner.

On the next series of blogs we will go deeper into the technical aspects as this blog just skims the surface but outlines the main tips for performance.

Shahed Munir & Krishna Udathu @ DELIVERBI

Yarn - Long Running Jobs Alerts for Hive TEZ , Application Containers

Hadoop / Hive / Yarn / Applications Long running Jobs alert script Currently at a client where their are some long running jobs...