Translate

Friday, 16 November 2018

Shahed


Google Drive API Google Sheets Extract to CSV with SERVICE KEY.JSON


Recent Client had some Google Sheets stored on their Google Drive and needed them processing into the Hadoop Cluster and available in presto for reporting.


As Google Drive uses Oauth authentication we needed to use a Service Key instead to access the files needed as we are processing them using airflow and a python script on a linux server.

We hunted around and could not find any solid example of a script that works on with the latest libraries. 

This script will extact the google workbook and each sheet will be a seperate csv file. within a given location path.

There is also an alternative script available that can loop over a .txt file and process google sheet ID's one after the next . 

Mandatory Parameters

Service Key and a Google Sheet ID and a path
Service Keys can be generated for a project within google. Then the email address of the service key needs to be used for sharing of the spreadsheet within google sheets

For : gdrivedownloadsheetsLoop.py
Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Setup

Make sure your python library for google api is this version ( can be higher but not tested on higher versions)
pip install google-api-python-client==1.6.2
Also make sure all other libraries are installed using pip these are visible in the script under import.
Download Python Script gdrivedownloadsheets.py to a directory on Linux.
Ammend the Following
Replace YOURSERVICEKEY.json with the filename of the service key you generated in Google. service_account_file = os.path.join(os.getcwd(), 'YOURSERVICEKEY.json')
This will be your output Path for your csv files (1 per worksheet within a workbook) path = "/GCSDrive/output/"

For : gdrivedownloadsheetsLoop.py Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory

Run Command

The parameter is the google sheet ID which is visible in most url's when viewing the sheet in google.
--------------- For running 1 workbook at a time 
python3 gdrivedownloadsheets.py 1gWxy05uEcO8a8fNAfUSIdV1OcRWxH7RnjXezoHImJLE
--------------- For running Multiple Workbooks one after another 
python3 gdrivedownloadsheetsLoop.py



Git Hub Location for Code : https://github.com/deliverbi/Gdrive-API-Google-Sheets


We will follow through with this post and blog the framework we created to Process these files directly to HDFS and through to Hive and Presto.

Power to the future of Big Data @ DELIVERBI

We actually Know Big Data !!

Cheers

Shahed Munir & Krishna Udathu




About Authors

Shahed Munir

Krishna Udathu

Shahed and Krishna are Oracle / Big Data Experts