GCS Dataproc Hadoop(Hive) to GCS Big Query (BQ) Quick Transfer ORC File Formats
Google Dataproc - Google's Hadoop cluster with add ons such as spark etc.
GCS (Google Cloud Storage) - Buckets to store data
Google Big Query - Googles data query engine
The Problem
Customer needed to Move over 50tb of data in orc file format from GCS Dataproc Cluster to Google Big Query quickly with a small footprint.
The Solution
On GCS - Created a Bucket that is linked to the same project as the DataProc Instance
On Dataproc (Hadoop Master Node) - Invoked a distributed copy command to copy the data and all of its contents to the bucket i created earlier within GCS Storage.
hadoop distcp hdfs://deliverbi:8020/user/hive/warehouse/deliverbi_db.db/big_table/* gs://deliverbi_bucket/deliverbi_all_tables/big_table/
The command will copy all the data from the underlying HDFS for the Hive table to the Google Storage bucket. Remember to take all the subfolders or partition folders as these will be handled as columns in Big Query using a CLI command to import the data. The Orc Import in Big Query is quite clever and recognises the Hive folder layout such as a partition folder called
"my_date=20190101"
"my_date=20190102"
"my_date=20190103" -> "my_sub_partition_field=1"
These partition folders will get created as fields within the BQ Table with the cli command to import data into Big Query. This way Hive functionality of field querying is maintained within BQ.
Once the Data is in the Bucket then goto Big Query and create a data set if needed in your project. The table will get created automatically based on the ORC File format.
Open a cloud shell window - or https://cloud.google.com/sdk/ Google Cloud SDK to issue a BQ command.
This command will import the data into Big Query from the Orc files we exported into the GCS bucket using distcp.
BQ Command to import a Partitioned Table
All files will get imported within subfolders etc
All files will get imported within subfolders etc
-- url_prefix -- give the parent folder that contains the data for your table
-- BQ Table Name
-- GCS Files location
-- BQ Table Name
-- GCS Files location
bq load --source_format=ORC --hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gs://deliverbi_bucket/deliverbi_all_tables/big_table/ \
deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/big_table_files/*
BQ Command to import a non Partitioned Table
All files will get imported within subfolders etc
-- BQ Table Name
-- GCS Files location
bq load --source_format=ORC deliverbi_project1:deliverbi_dataset.big_table gs://deliverbi_bucket/deliverbi_all_tables/big_table/non_paritioned_data_files_folder/*
Over and Out
Shahed Munir
Shahed Munir
1 comments:
Write commentsGcs Hadoop/Hive To Google Big Query Migrate Data Quickly Orc Format - Deliverbi Blog Gcp Google >>>>> Download Now
Reply>>>>> Download Full
Gcs Hadoop/Hive To Google Big Query Migrate Data Quickly Orc Format - Deliverbi Blog Gcp Google >>>>> Download LINK
>>>>> Download Now
Gcs Hadoop/Hive To Google Big Query Migrate Data Quickly Orc Format - Deliverbi Blog Gcp Google >>>>> Download Full
>>>>> Download LINK
Note: only a member of this blog may post a comment.