Hive Tez vs Presto as a query Engine & Performance12 BILLION ROWS AND a 250 Column Table with Dimensions.
Here at DELIVERBI we have been implementing quite a few Big Data
Projects. At one of our more recent clients we required speedy analytics using
a query engine and technology that could query almost 15 Billion rows of data
over various partitions. The main table has 250 columns so is quite wide and
the data was roughly 20TB for the one table, On top of this their were joins to
various dimensions. This is a more than real world example that will have a
user base of around 1000 users querying data over 3 years worth of daily
partitions.
We tried various approaches including spark , impala , drill , Hive MR
and various other tech stacks.
We settled on the Below Technologies which work seamlessly and have
produced outstanding results for our client.
Hadoop, Hive , Presto , Yarn , Airflow for Orchestration of Loading.
20 Node Cluster - All Nodes - 8cpu - 52gb Memory
We went for the divide and conquer technique and it works !!.
Lets Begin , I will also include some tuning tips as we go along. Their is a lot more involved but we will skim over the major points.
Hive Version 2.3+ Upwards - We required the TEZ engine here to load data
daily and throughout the day to various ORC tables within the Hadoop
Filesystem.
TEZ - Its quick , quicker than MR and we used it to load data into the
Hive tables. TEZ is prominent over map reduce by using hadoop
containers efficiently, multiple reduce phases without map phases and effective
use of HDFS. Make sure the tez container size fits within the Yarn Container
sizes.
TEZ Container Sizes are important - Roughly 80% of
the Yarn Container size. These can be tuned further otherwise you will get out
of memory errors.
---------------------------------------------------------------
yarn-site.xml
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>6144</value>
<final>false</final>
<source>Dataproc Cluster
Properties</source>
</property>
---------------------------------------------------------------
hive-site.xml
<name>tez.am.resource.memory.mb</name>
<value>5120</value>
</property>
<property>
<name>hive.tez.container.size</name>
<value>6144</value>
</property>
After the above settings all memory errors were
gone. :)
Also these settings help within the hive-site.xml
and we had an increased performance. We have used many more that are in line with our clients requirements.
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.cbo.enable</name>
<value>true</value>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>true</value>
</property>
<property>
<name>hive.stats.fetch.column.stats</name>
<value>true</value>
</property>
<property>
<name>hive.stats.fetch.partition.stats</name>
<value>true</value>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.exec.parallel</name>
<value>true</value>
</property>
<property>
We have loads of other settings but if you will be
using hive as a query engine too.
Pre warm some Tez Hive containers for future
queries and re usability and remove old sessions when they expire etc the
settings below will help with this . There are loads more settings that can be
used.
<property>
<name>hive.server2.idle.session.timeout</name>
<value>3300000</value>
</property>
<property>
<name>hive.server2.session.check.interval</name>
<value>3600000</value>
</property>
<property>
<name>hive.server2.idle.operation.timeout</name>
<value>7200000</value>
</property>
<property>
<name>hive.server2.tez.sessions.per.default.queue</name>
<value>4</value>
</property>
<property>
<name>hive.server2.tez.initialize.default.sessions</name>
<value>true</value>
</property>
<property>
<name>hive.prewarm.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.prewarm.numcontainers</name>
<value>2</value>
</property>
<property>
<name>tez.am.container.reuse.enabled</name>
<value>true</value>
</property>
ORC table Definition - This one works very well with large data sets.
CREATE TABLE MytableFact(
)
partitioned by (ETL_DATE date)
STORED AS ORC tblproperties
("orc.compress" = "SNAPPY" ,
"orc.stripe.size"="536870912"
,
"orc.row.index.stride"="50000","auto.purge"="true");
The ORC table with a compression of SNAPPY works
very well when loading/querying either in hive or presto and produces results
quick.
Google Stripe and Stride for more info . The above
settings worked well for us with the data volumes we used and above.
The partitions help when you need to reload data
and so does gathering statistics for the tables and columns.
To replace data within a partition we used a Date
column.
INSERT OVERWRITE TABLE mytablename PARTITION(ETL_DATE) (Will automatically Wipe the partition(s) and reload data), Insert into
Before we load we enable the gathering of Table
level statistics automatically by setting the below in our hive session
set hive.stats.autogather=true
We also needed to gather statistics for the column level data within
partitions we load
analyze table mytablename(etl_date='2017-06-01') compute statistics for
columns;
Presto Configurations are soooo easy just set the JVM memory on the
Co-ordinator and worker nodes to roughly 70% of the cluster mem size and
restart everything.
With the above settings and data volumes the timings are as follows
Presto
1. Fact table query with sum and group by whole table - less than 1
minute
2. Fact Table with a date filter - less than 3 seconds.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a
Date - 5 seconds
As you can see the timings are phenomenal compared to using traditional
RDBMS databases with these kind of volumes and a super wide table.
Hive
Its slower but we are using it for Batch Jobs using Airflow for
orchestration as manage our yarn queue's and resource capacity on who can run which etl job. We will cover capacity-management within another blog. Their is just soooo much to cover on what we implemented.
1. Fact table query with sum and group by whole table - less than 5
minutes
2. Fact Table with a date filter - less than 2 Minutes.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a
Date - 5 Minutes
Pre-Warming the containers take 30 seconds off each query roughly.
We know Big Data as we have been tuning and have built our solutions on
various platforms. Its the way forward
Presto is a clear winner and using Hive for our user base that is
familiar with traditional sql but with a rocket under the bonnet made it clear winner.
On the next series of blogs we will go deeper into the technical aspects as this blog just skims the surface but outlines the main tips for performance.
Shahed Munir & Krishna Udathu @ DELIVERBI
Note: only a member of this blog may post a comment.