Translate

Monday, 9 April 2018

Shahed

Google Big Query Clone / Copy DataSets / Tables - PROD --> TEST --> DEV Projects

Google Big Query Cloning of Datasets & Tables across GCS Projects.



We searched the internet and could not find a simple cloning / copying of Tables and Datasets script from PROD --> TEST --> DEV in Google Big Query etc with ease. We needed a utility that had the option of copying complete datasets across projects within Google Big Query. There is the option within the Big Query UI to copy 1 table at a time but that would take us forever. Client is using SDLC and working through 3 envs. We needed to clone data from Production to our Test Environment needed for Shake down testing of Airflow deliverables (DAGS etc). Their are probably a lot of customers out their that have started loading their data into an initial environment and now need to copy the datasets to the other projects, What ever the reason may be we have this excellent script that can be run to leverage some of the work, The script can easily be amended and scheduled to run on weekends or evenings.




















Technologies Used
  • Linux
  • Python3
  • Pip library: google.cloud import bigquery

Please make sure : pip install --upgrade google-cloud-storage

Notes on API.
https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html

Make sure the GOOGLE_APPLICATION_CREDENTIALS is set for the service key you have downloaded from your Production project.

export GOOGLE_APPLICATION_CREDENTIALS=/u01/yourservicekey.json

If you don’t have a service key then a service Key can be generated from GCS console web application GUI. A service key is an alternative to a user_id.



 

 
 
 
Make sure your service_key has all the permission required to run tasks on big query , ie querying tables , creating tables and deleting tables to the DEV & TEST project.  You can add the service key alias email address to the DEV & TEST projects under IAM permissions to allow the script to copy datasets etc. Here you can see we have given the service account access to DEV with the role BigQueryAdmin.



 

 
 
Check That the BQ command is working on your GCS Linux instance

Try:  bq ls  to list the projects on the GCS Linux VM Instance

The .py Script has a lock so that it wont clone anything back to the Production environment and we suggest you use it.

As for best practice PRODUCTION à TEST à DEV is the best approach for cloning. One off tables and structures are also catered for and then DEV à TEST à PRODUCTION makes sense but we recommend CICD pipelines or alternatives for this.


 
Run the script using the following command                          
python3 bq_dataset_migrator.py source_project source_dataset target_project target_dataset
Variables
1 - Source Project
2 - Source Data Set
3 - Target Project
4 - Target Data Set

Note: only a member of this blog may post a comment.



About Authors

Shahed Munir

Krishna Udathu

Shahed and Krishna are Oracle / Big Data Experts