Translate

Tuesday, 30 October 2018

Shahed

Hive Tez vs Presto 12 Billion Records and 250 Columns Performance

Hive Tez vs Presto as a query Engine & Performance12 BILLION ROWS AND a 250 Column Table with Dimensions. 






Here at DELIVERBI we have been implementing quite a few Big Data Projects. At one of our more recent clients we required speedy analytics using a query engine and technology that could query almost 15 Billion rows of data over various partitions. The main table has 250 columns so is quite wide and the data was roughly 20TB for the one table, On top of this their were joins to various dimensions. This is a more than real world example that will have a user base of around 1000 users querying data over 3 years worth of daily partitions.

We tried various approaches including spark , impala , drill , Hive MR and various other tech stacks.

We settled on the Below Technologies which work seamlessly and have produced outstanding results for our client.

Hadoop, Hive , Presto , Yarn , Airflow for Orchestration of Loading.

20 Node Cluster - All Nodes - 8cpu - 52gb Memory 
We went for the divide and conquer technique and it works !!. 


Lets Begin , I will also include some tuning tips as we go along. Their is a lot more involved but we will skim over the major points.

Hive Version 2.3+ Upwards - We required the TEZ engine here to load data daily and throughout the day to various ORC tables within the Hadoop Filesystem.

TEZ - Its quick , quicker than MR and we used it to load data into the Hive tables. TEZ is prominent over map reduce by using hadoop containers efficiently, multiple reduce phases without map phases and effective use of HDFS. Make sure the tez container size fits within the Yarn Container sizes.

TEZ Container Sizes are important - Roughly 80% of the Yarn Container size. These can be tuned further otherwise you will get out of memory errors.

--------------------------------------------------------------- yarn-site.xml

 <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>6144</value>
    <final>false</final>
    <source>Dataproc Cluster Properties</source>
  </property>

---------------------------------------------------------------  hive-site.xml
<name>tez.am.resource.memory.mb</name>
<value>5120</value>
</property>
<property>
<name>hive.tez.container.size</name>
<value>6144</value>
</property>

After the above settings all memory errors were gone. :)

Also these settings help within the hive-site.xml and we had an increased performance. We have used many more that are in line with our clients requirements.

<property>
     <name>hive.exec.dynamic.partition.mode</name>
     <value>nonstrict</value>
</property>
<property>
     <name>hive.cbo.enable</name>
     <value>true</value>
</property>
<property>
     <name>hive.compute.query.using.stats</name>
     <value>true</value>
</property>
<property>
     <name>hive.stats.fetch.column.stats</name>
     <value>true</value>
</property>
<property>
     <name>hive.stats.fetch.partition.stats</name>
     <value>true</value>
</property>
<property>
     <name>hive.vectorized.execution.enabled</name>
     <value>true</value>
</property>
<property>
     <name>hive.exec.parallel</name>
     <value>true</value>
</property>
<property>

We have loads of other settings but if you will be using hive as a query engine too. 

Pre warm some Tez Hive containers for future queries and re usability and remove old sessions when they expire etc the settings below will help with this . There are loads more settings that can be used.

<property>
 <name>hive.server2.idle.session.timeout</name>
 <value>3300000</value>
</property>
<property>
 <name>hive.server2.session.check.interval</name>
 <value>3600000</value>
</property>
<property>
 <name>hive.server2.idle.operation.timeout</name>
 <value>7200000</value>
</property>
<property>
<name>hive.server2.tez.sessions.per.default.queue</name>
<value>4</value>
</property>
<property>
<name>hive.server2.tez.initialize.default.sessions</name>
<value>true</value>
</property>
<property>
<name>hive.prewarm.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.prewarm.numcontainers</name>
<value>2</value>
</property>
<property>
<name>tez.am.container.reuse.enabled</name>
<value>true</value>
</property>


ORC table Definition - This one works very well with large data sets.

CREATE TABLE MytableFact(
)  
partitioned by (ETL_DATE date)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY" , "orc.stripe.size"="536870912" 
, "orc.row.index.stride"="50000","auto.purge"="true");


The ORC table with a compression of SNAPPY works very well when loading/querying either in hive or presto and produces results quick.

Google Stripe and Stride for more info . The above settings worked well for us with the data volumes we used and above.

The partitions help when you need to reload data and so does gathering statistics for the tables and columns.

To replace data within a partition we used a Date column.
INSERT OVERWRITE TABLE mytablename PARTITION(ETL_DATE) (Will automatically Wipe the partition(s) and reload data), Insert into 


Before we load we enable the gathering of Table level statistics automatically by setting the below in our hive session

set hive.stats.autogather=true


We also needed to gather statistics for the column level data within partitions we load

analyze table mytablename(etl_date='2017-06-01') compute statistics for columns;

Presto Configurations are soooo easy just set the JVM memory on the Co-ordinator and worker nodes to roughly 70% of the cluster mem size and restart everything.

With the above settings and data volumes the timings are as follows


Presto 
1. Fact table query with sum and group by whole table - less than 1 minute
2. Fact Table with a date filter - less than 3 seconds.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 seconds

As you can see the timings are phenomenal compared to using traditional RDBMS databases with these kind of volumes and a super wide table.


Hive 

Its slower but we are using it for Batch Jobs using Airflow for orchestration as manage our yarn queue's and resource capacity on who can run which etl job. We will cover capacity-management within another blog. Their is just soooo much to cover on what we implemented.

1. Fact table query with sum and group by whole table - less than 5 minutes
2. Fact Table with a date filter - less than 2 Minutes.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 Minutes

Pre-Warming the containers take 30 seconds off each query roughly.


We know Big Data as we have been tuning and have built our solutions on various platforms. Its the way forward

Presto is a clear winner and using Hive for our user base that is familiar with traditional sql but with a rocket under the bonnet made it clear winner.

On the next series of blogs we will go deeper into the technical aspects as this blog just skims the surface but outlines the main tips for performance.


Shahed Munir & Krishna Udathu @ DELIVERBI


About Authors

Shahed Munir

Krishna Udathu

Shahed and Krishna are Oracle / Big Data Experts