Google Drive API Google Sheets Extract to CSV with SERVICE KEY.JSON
Recent Client had some Google Sheets stored on their Google Drive and needed them processing into the Hadoop Cluster and available in presto for reporting.
As Google Drive uses Oauth authentication we needed to use a Service Key instead to access the files needed as we are processing them using airflow and a python script on a linux server.
We hunted around and could not find any solid example of a script that works on with the latest libraries.
This script will extact the google workbook and each sheet will be a seperate csv file. within a given location path.
There is also an alternative script available that can loop over a .txt file and process google sheet ID's one after the next .
There is also an alternative script available that can loop over a .txt file and process google sheet ID's one after the next .
Mandatory Parameters
Service Key and a Google Sheet ID and a path
Service Keys can be generated for a project within google. Then the email address of the service key needs to be used for sharing of the spreadsheet within google sheets
For : gdrivedownloadsheetsLoop.py
Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory
Setup
Make sure your python library for google api is this version ( can be higher but not tested on higher versions)
pip install google-api-python-client==1.6.2
Also make sure all other libraries are installed using pip these are visible in the script under import.
Download Python Script gdrivedownloadsheets.py to a directory on Linux.
Ammend the Following
Replace YOURSERVICEKEY.json with the filename of the service key you generated in Google. service_account_file = os.path.join(os.getcwd(), 'YOURSERVICEKEY.json')
This will be your output Path for your csv files (1 per worksheet within a workbook) path = "/GCSDrive/output/"
For : gdrivedownloadsheetsLoop.py Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory
For : gdrivedownloadsheetsLoop.py Requires only service key.json file name, The google sheet names are derived from a.txt file you place in the same directory
Run Command
The parameter is the google sheet ID which is visible in most url's when viewing the sheet in google.
--------------- For running 1 workbook at a time
python3 gdrivedownloadsheets.py 1gWxy05uEcO8a8fNAfUSIdV1OcRWxH7RnjXezoHImJLE
--------------- For running Multiple Workbooks one after another
python3 gdrivedownloadsheetsLoop.py
Git Hub Location for Code : https://github.com/deliverbi/Gdrive-API-Google-Sheets
We will follow through with this post and blog the framework we created to Process these files directly to HDFS and through to Hive and Presto.
Power to the future of Big Data @ DELIVERBI
We actually Know Big Data !!
Cheers
Shahed Munir & Krishna Udathu
Note: only a member of this blog may post a comment.