Friday, 22 November 2019


GCS HADOOP/Hive to Google Big Query Migrate Data Quickly ORC Format

GCS Dataproc Hadoop(Hive) to GCS Big Query (BQ) Quick Transfer ORC File Formats

Software Used

Google Dataproc - Google's Hadoop cluster with add ons such as spark etc.
GCS (Google Cloud Storage) - Buckets to store data
Google Big Query - Googles data query engine

The Problem

Customer needed to Move over 50tb of data in orc file format from GCS Dataproc Cluster to Google Big Query quickly with a small footprint.

The Solution

On GCS - Created a Bucket that is linked to the same project as the DataProc Instance

On Dataproc (Hadoop Master Node) - Invoked a distributed copy command to copy the data and all of its contents to the bucket i created earlier within GCS Storage.

hadoop distcp hdfs://deliverbi:8020/user/hive/warehouse/deliverbi_db.db/big_table/* gs://deliverbi_bucket/deliverbi_all_tables/big_table/

The command will copy all the data from the underlying HDFS for the Hive table to the Google Storage bucket. Remember to take all the subfolders or partition folders as these will be handled as columns in Big Query using a CLI command to import the data. The Orc Import in Big Query is quite clever and recognises the Hive folder layout such as a partition folder called 

"my_date=20190103" -> "my_sub_partition_field=1"

These partition folders will get created as fields within the BQ Table with the cli command to import data into Big Query. This way Hive functionality of field querying is maintained within BQ.

Once the Data is in the Bucket then goto Big Query and create a data set if needed in your project. The table will get created automatically based on the ORC File format.

Open a cloud shell window - or Google Cloud SDK to issue a BQ command.

This command will import the data into Big Query from the Orc files we exported into the GCS bucket using distcp.

BQ Command to import a Partitioned Table 
All files will get imported within subfolders etc 
-- url_prefix -- give the parent folder that contains the data for your table
-- BQ Table Name
-- GCS Files location

bq load --source_format=ORC --hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gs://deliverbi_bucket/deliverbi_all_tables/big_table/ \
deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/big_table_files/*

BQ Command to import a non Partitioned Table
All files will get imported within subfolders etc 
-- BQ Table Name
-- GCS Files location

bq load --source_format=ORC deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/non_paritioned_data_files_folder/*

Over and Out 

Shahed Munir

About Authors

Shahed Munir

Krishna Udathu

Shahed and Krishna are Oracle / Big Data Experts