Friday, 22 November 2019

GCS HADOOP/Hive to Google Big Query Migrate Data Quickly ORC Format

November 22, 2019 Hadoop and Hive

GCS Dataproc Hadoop(Hive) to GCS Big Query (BQ) Quick Transfer ORC File Formats

Software Used

Google Dataproc - Google's Hadoop cluster with add ons such as spark etc.

GCS (Google Cloud Storage) - Buckets to store data

Google Big Query - Googles data query engine

The Problem

Customer needed to Move over 50tb of data in orc file format from GCS Dataproc Cluster to Google Big Query quickly with a small footprint.

The Solution

On GCS - Created a Bucket that is linked to the same project as the DataProc Instance

On Dataproc (Hadoop Master Node) - Invoked a distributed copy command to copy the data and all of its contents to the bucket i created earlier within GCS Storage.

hadoop distcp hdfs://deliverbi:8020/user/hive/warehouse/deliverbi_db.db/big_table/* gs://deliverbi_bucket/deliverbi_all_tables/big_table/

The command will copy all the data from the underlying HDFS for the Hive table to the Google Storage bucket. Remember to take all the subfolders or partition folders as these will be handled as columns in Big Query using a CLI command to import the data. The Orc Import in Big Query is quite clever and recognises the Hive folder layout such as a partition folder called

"my_date=20190101"

"my_date=20190102"

"my_date=20190103" -> "my_sub_partition_field=1"

These partition folders will get created as fields within the BQ Table with the cli command to import data into Big Query. This way Hive functionality of field querying is maintained within BQ.

Once the Data is in the Bucket then goto Big Query and create a data set if needed in your project. The table will get created automatically based on the ORC File format.

Open a cloud shell window - or https://cloud.google.com/sdk/ Google Cloud SDK to issue a BQ command.

This command will import the data into Big Query from the Orc files we exported into the GCS bucket using distcp.

BQ Command to import a Partitioned Table
All files will get imported within subfolders etc

-- url_prefix -- give the parent folder that contains the data for your table
-- BQ Table Name
-- GCS Files location

bq load --source_format=ORC --hive_partitioning_mode=AUTO \

--hive_partitioning_source_uri_prefix=gs://deliverbi_bucket/deliverbi_all_tables/big_table/ \

deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/big_table_files/*

BQ Command to import a non Partitioned Table

All files will get imported within subfolders etc

-- BQ Table Name

-- GCS Files location

bq load --source_format=ORC deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/non_paritioned_data_files_folder/*

Over and Out

Shahed Munir

1 comments:

Write comments

Unknown

AUTHOR

10 March 2022 at 21:00 delete

Gcs Hadoop/Hive To Google Big Query Migrate Data Quickly Orc Format - Deliverbi Blog Gcp Google >>>>> Download Now

>>>>> Download Full

Gcs Hadoop/Hive To Google Big Query Migrate Data Quickly Orc Format - Deliverbi Blog Gcp Google >>>>> Download LINK

>>>>> Download Now

Gcs Hadoop/Hive To Google Big Query Migrate Data Quickly Orc Format - Deliverbi Blog Gcp Google >>>>> Download Full

>>>>> Download LINK

Note: only a member of this blog may post a comment.

Translate

Friday, 22 November 2019

GCS HADOOP/Hive to Google Big Query Migrate Data Quickly ORC Format

GCS Dataproc Hadoop(Hive) to GCS Big Query (BQ) Quick Transfer ORC File Formats

1 comments:

Blog Archive

Shahed Munir

Krishna Udathu

DELIVER BI Big Data Experts

DELIVER BI Content Search

Have a Look around

Google Big Query (BQ) Results to GCS Storage Bucket Folder

Popular Posts

Pageviews

Krishna Mohan Udathu

Shahed Munir