Translate

DELIVERBI Blog OBIEE , OBIA , ETL & Big Data

Friday, 16 November 2018


Google Drive API Google Sheets Extract to CSV with SERVICE KEY.JSON


Recent Client had some Google Sheets stored on their Google Drive and needed them processing into the Hadoop Cluster and available in presto for reporting.


As Google Drive uses Oauth authentication we needed to use a Service Key instead to access the files needed as we are processing them using airflow and a python script on a linux server.

We hunted around and could not find any solid example of a script that works on with the latest libraries. 

This script will extact the google workbook and each sheet will be a seperate csv file. within a given location path.

There is also an alternative script available that can loop over a .txt file and process google sheet ID's one after the next . 

Mandatory Parameters

Service Key and a Google Sheet ID and a path
Service Keys can be generated for a project within google. Then the email address of the service key needs to be used for sharing of the spreadsheet within google sheets

For : gdrivedownloadsheetsLoop.py
Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Setup

Make sure your python library for google api is this version ( can be higher but not tested on higher versions)
pip install google-api-python-client==1.6.2
Also make sure all other libraries are installed using pip these are visible in the script under import.
Download Python Script gdrivedownloadsheets.py to a directory on Linux.
Ammend the Following
Replace YOURSERVICEKEY.json with the filename of the service key you generated in Google. service_account_file = os.path.join(os.getcwd(), 'YOURSERVICEKEY.json')
This will be your output Path for your csv files (1 per worksheet within a workbook) path = "/GCSDrive/output/"

For : gdrivedownloadsheetsLoop.py Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Run Command

The parameter is the google sheet ID which is visible in most url's when viewing the sheet in google.
--------------- For running 1 workbook at a time 
python3 gdrivedownloadsheets.py 1gWxy05uEcO8a8fNAfUSIdV1OcRWxH7RnjXezoHImJLE
--------------- For running Multiple Workbooks one after another 
python3 gdrivedownloadsheetsLoop.py



Git Hub Location for Code : https://github.com/deliverbi/Gdrive-API-Google-Sheets


We will follow through with this post and blog the framework we created to Process these files directly to HDFS and through to Hive and Presto.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Cheers

Shahed Munir & Krishna Udathu




Thursday, 8 November 2018

Hive HPLSQL setup on Google DataProc

Hive HPLSQL setup on Google DataProc




Google Dataproc Hadoop and Hive
Hive Version : 2.3.2 (Version Supports Tez Engine)

More Information on hplsql

http://www.hplsql.org/
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156


As always we are trying to implement some functionality and its not working with a security setup on Hive aswell as the Tez execution engine etc.


Here are the Steps to get it going.




To enable HPLSQL . Cant find any documentation on this for google dataproc anywhere !!


As we have security etc setup on hive you have to edit the hplsql-site.xml to connect to the thriftserver2 for hive and to do this you need to follow the steps below



Find the following File

/usr/lib/hive/lib/hive-hplsql-2.3.2.jar

Copy the file to a folder and unjar it 

jar -xvf hive-hplsql-2.3.2.jar


vi the file hplsql-site.xml (You Will see it when you unjar the above file)


Ammend the hiveconn2 connection and add username and password etc
I have also ammended the config so i can run querys on tez within my specific DB. Its all quite straightforward.

<property>
  <name>hplsql.conn.hive2conn</name>  <value>org.apache.hive.jdbc.HiveDriver;jdbc:hive2://localhost:10000;myuser;mypassword</value>
  <description>HiveServer2 JDBC connection</description>
</property>
<property>
  <name>hplsql.conn.init.hive2conn</name>
  <value>
     set hive.execution.engine=tez;
     use myspecificdb;
  </value>
  <description>Statements for execute after connection to the database</description>
</property>


Remove the jar file currently in folder and run the jar command within the folder you are editing the hplsql-site.xml file

jar cvf hive-hplsql-2.3.2.jar *


Copy the jar file back to the original location

cp hive-hplsql-2.3.2.jar /usr/lib/hive/lib/.


Command to test

/usr/lib/hive/bin# ./hplsql -e "select * from mydb.mytablename"


Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!


Cheers
Shahed Munir & Krishna Udathu

Tuesday, 6 November 2018

PowerBI Custom Direct Query ODBC Connector


PowerBI Custom Direct Query ODBC Connector


At DELIVERBI we have various challenges presented to us and a client of ours uses Power BI as their main reporting and analytics tool and wanted Direct Query on any ODBC Source.
We implemented Presto , Hadoop , hive etc as we outlined in an earlier blog briefly. To compliment this solution we needed to be able to query Presto on a Direct Query for transactional level data on the fly.

We created a Power BI custom connector for this that sits on top of a windows ODBC connection. Its quite simple really but we have compiled the .mez which is the custom connector file that you will require to start importing or Direct Querying data straight away.
Its available to download from our github Location


The Connector Supports all ODBC Datasources and has been optimized for Presto and Hive too. We used the treasure data ODBC driver for Presto as that's what was needed and a mapr 2.15 ODBC driver for Hive. This Connector is generic so any ODBC Driver can be used. Also the connector has been tested on most versions of PowerBI.
Remember to place the .mez file within your windows documents folder under 2 named folders which is the default location for Powerbi to pick up custom connectors.
It will appear under Other within the Powerbi application.


If you would like to know more about compiling the connector or have a custom connector created and delivered to your inbox that's skinned for a corporate look and feel then please contact us for more information.

Enjoy the connector and helps like it did for us. We bypassed the source code etc so business users can enjoy the benefit of a fully compiled Connector , Just store it in the location and start using it for Direct Querying any ODBC Datasource.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!


Cheers
Shahed Munir & Krishna Udathu




Tuesday, 30 October 2018

Hive Tez vs Presto 12 Billion Records and 250 Columns Performance

Hive Tez vs Presto as a query Engine & Performance12 BILLION ROWS AND a 250 Column Table with Dimensions. 






Here at DELIVERBI we have been implementing quite a few Big Data Projects. At one of our more recent clients we required speedy analytics using a query engine and technology that could query almost 15 Billion rows of data over various partitions. The main table has 250 columns so is quite wide and the data was roughly 20TB for the one table, On top of this their were joins to various dimensions. This is a more than real world example that will have a user base of around 1000 users querying data over 3 years worth of daily partitions.

We tried various approaches including spark , impala , drill , Hive MR and various other tech stacks.

We settled on the Below Technologies which work seamlessly and have produced outstanding results for our client.

Hadoop, Hive , Presto , Yarn , Airflow for Orchestration of Loading.

20 Node Cluster - All Nodes - 8cpu - 52gb Memory 
We went for the divide and conquer technique and it works !!. 


Lets Begin , I will also include some tuning tips as we go along. Their is a lot more involved but we will skim over the major points.

Hive Version 2.3+ Upwards - We required the TEZ engine here to load data daily and throughout the day to various ORC tables within the Hadoop Filesystem.

TEZ - Its quick , quicker than MR and we used it to load data into the Hive tables. TEZ is prominent over map reduce by using hadoop containers efficiently, multiple reduce phases without map phases and effective use of HDFS. Make sure the tez container size fits within the Yarn Container sizes.

TEZ Container Sizes are important - Roughly 80% of the Yarn Container size. These can be tuned further otherwise you will get out of memory errors.

--------------------------------------------------------------- yarn-site.xml

 <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>6144</value>
    <final>false</final>
    <source>Dataproc Cluster Properties</source>
  </property>

---------------------------------------------------------------  hive-site.xml
<name>tez.am.resource.memory.mb</name>
<value>5120</value>
</property>
<property>
<name>hive.tez.container.size</name>
<value>6144</value>
</property>

After the above settings all memory errors were gone. :)

Also these settings help within the hive-site.xml and we had an increased performance. We have used many more that are in line with our clients requirements.

<property>
     <name>hive.exec.dynamic.partition.mode</name>
     <value>nonstrict</value>
</property>
<property>
     <name>hive.cbo.enable</name>
     <value>true</value>
</property>
<property>
     <name>hive.compute.query.using.stats</name>
     <value>true</value>
</property>
<property>
     <name>hive.stats.fetch.column.stats</name>
     <value>true</value>
</property>
<property>
     <name>hive.stats.fetch.partition.stats</name>
     <value>true</value>
</property>
<property>
     <name>hive.vectorized.execution.enabled</name>
     <value>true</value>
</property>
<property>
     <name>hive.exec.parallel</name>
     <value>true</value>
</property>
<property>

We have loads of other settings but if you will be using hive as a query engine too. 

Pre warm some Tez Hive containers for future queries and re usability and remove old sessions when they expire etc the settings below will help with this . There are loads more settings that can be used.

<property>
 <name>hive.server2.idle.session.timeout</name>
 <value>3300000</value>
</property>
<property>
 <name>hive.server2.session.check.interval</name>
 <value>3600000</value>
</property>
<property>
 <name>hive.server2.idle.operation.timeout</name>
 <value>7200000</value>
</property>
<property>
<name>hive.server2.tez.sessions.per.default.queue</name>
<value>4</value>
</property>
<property>
<name>hive.server2.tez.initialize.default.sessions</name>
<value>true</value>
</property>
<property>
<name>hive.prewarm.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.prewarm.numcontainers</name>
<value>2</value>
</property>
<property>
<name>tez.am.container.reuse.enabled</name>
<value>true</value>
</property>


ORC table Definition - This one works very well with large data sets.

CREATE TABLE MytableFact(
)  
partitioned by (ETL_DATE date)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY" , "orc.stripe.size"="536870912" 
, "orc.row.index.stride"="50000","auto.purge"="true");


The ORC table with a compression of SNAPPY works very well when loading/querying either in hive or presto and produces results quick.

Google Stripe and Stride for more info . The above settings worked well for us with the data volumes we used and above.

The partitions help when you need to reload data and so does gathering statistics for the tables and columns.

To replace data within a partition we used a Date column.
INSERT OVERWRITE TABLE mytablename PARTITION(ETL_DATE) (Will automatically Wipe the partition(s) and reload data), Insert into 


Before we load we enable the gathering of Table level statistics automatically by setting the below in our hive session

set hive.stats.autogather=true


We also needed to gather statistics for the column level data within partitions we load

analyze table mytablename(etl_date='2017-06-01') compute statistics for columns;

Presto Configurations are soooo easy just set the JVM memory on the Co-ordinator and worker nodes to roughly 70% of the cluster mem size and restart everything.

With the above settings and data volumes the timings are as follows


Presto 
1. Fact table query with sum and group by whole table - less than 1 minute
2. Fact Table with a date filter - less than 3 seconds.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 seconds

As you can see the timings are phenomenal compared to using traditional RDBMS databases with these kind of volumes and a super wide table.


Hive 

Its slower but we are using it for Batch Jobs using Airflow for orchestration as manage our yarn queue's and resource capacity on who can run which etl job. We will cover capacity-management within another blog. Their is just soooo much to cover on what we implemented.

1. Fact table query with sum and group by whole table - less than 5 minutes
2. Fact Table with a date filter - less than 2 Minutes.
3. Fact Table with 4 Dimensional Joins and Filters and Group By on a Date - 5 Minutes

Pre-Warming the containers take 30 seconds off each query roughly.


We know Big Data as we have been tuning and have built our solutions on various platforms. Its the way forward

Presto is a clear winner and using Hive for our user base that is familiar with traditional sql but with a rocket under the bonnet made it clear winner.

On the next series of blogs we will go deeper into the technical aspects as this blog just skims the surface but outlines the main tips for performance.


Shahed Munir & Krishna Udathu @ DELIVERBI


Monday, 9 April 2018

Google Big Query Clone / Copy DataSets / Tables - PROD --> TEST --> DEV Projects

Google Big Query Cloning of Datasets & Tables across GCS Projects.



We searched the internet and could not find a simple cloning / copying of Tables and Datasets script from PROD --> TEST --> DEV in Google Big Query etc with ease. We needed a utility that had the option of copying complete datasets across projects within Google Big Query. There is the option within the Big Query UI to copy 1 table at a time but that would take us forever. Client is using SDLC and working through 3 envs. We needed to clone data from Production to our Test Environment needed for Shake down testing of Airflow deliverables (DAGS etc). Their are probably a lot of customers out their that have started loading their data into an initial environment and now need to copy the datasets to the other projects, What ever the reason may be we have this excellent script that can be run to leverage some of the work, The script can easily be amended and scheduled to run on weekends or evenings.




















Technologies Used
  • Linux
  • Python3
  • Pip library: google.cloud import bigquery

Please make sure : pip install --upgrade google-cloud-storage

Notes on API.
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html

Make sure the GOOGLE_APPLICATION_CREDENTIALS is set for the service key you have downloaded from your Production project.

export GOOGLE_APPLICATION_CREDENTIALS=/u01/yourservicekey.json

If you don’t have a service key then a service Key can be generated from GCS console web application GUI. A service key is an alternative to a user_id.



 

 
 
 
Make sure your service_key has all the permission required to run tasks on big query , ie querying tables , creating tables and deleting tables to the DEV & TEST project.  You can add the service key alias email address to the DEV & TEST projects under IAM permissions to allow the script to copy datasets etc. Here you can see we have given the service account access to DEV with the role BigQueryAdmin.



 

 
 
Check That the BQ command is working on your GCS Linux instance

Try:  bq ls  to list the projects on the GCS Linux VM Instance

The .py Script has a lock so that it wont clone anything back to the Production environment and we suggest you use it.

As for best practice PRODUCTION à TEST à DEV is the best approach for cloning. One off tables and structures are also catered for and then DEV à TEST à PRODUCTION makes sense but we recommend CICD pipelines or alternatives for this.


 
Run the script using the following command                          
python3 bq_dataset_migrator.py source_project source_dataset target_project target_dataset
Variables
1 - Source Project
2 - Source Data Set
3 - Target Project
4 - Target Data Set

Google Drive API Google Sheets Extract to CSV with SERVICE KEY.JSON Recent Client had some Google Sheets stored on their Goo...