I solved this - since it's for a job application I'm not going to provide the full solution but I can point you in the right direction. I don't think you need to complicate things with Google Docs APIs - if you look at the response from the url, you'll notice that the data you want is in a table tag and that there is only one table element in the document. You can use an HTML parser to fetch this data pretty easily. Suggest you look at html.parser
For me steps were:
Create a parser that finds the table data you want and extracts the values from each row (start by just printing them to the console). You'll need to have a way to track when you hit a start tag for a table and also an end tag for a table. Also account for the fact that the first row has no data (column headers)
Create some data object that will be populated by the parser - I used a List of List objects. You'll want to call a method that updates this after reading a row of data from the table (every 3 values). You'll also need to handle padding - if you get a value that "skips" columns or lines, the skipped lines should be padded with empty lines and the skipped columns should be padded with spaces.
You'll need a method to print this data object. Note that the (0,0) value is in the bottom left of the image
This should be enough to get you started:
import requests
from html.parser import HTMLParser
class GoogleDocDecoderHTMLParser(HTMLParser):
inTable = False
def handle_starttag(self, tag, attrs):
if tag == "table":
self.inTable = True
def handle_endtag(self, tag):
if tag == "table":
self.inTable = False
def handle_data(self, data):
if self.inTable:
print(data)
class GoogleDocDecoder:
URL = "https://docs.google.com/document/d/e/2PACX-1vQGUck9HIFCyezsrBSnmENk5ieJuYwpt7YHYEzeNJkIb9OSDdx-ov2nRNReKQyey-cwJOoEKUhLmN9z/pub"
def decodeFromUrl(url):
response = requests.get(url)
html_content = response.content
parser = GoogleDocDecoderHTMLParser()
parser.feed(response.text)
decodeFromUrl(URL)
response = requests.get('https://docs.google.com/document/d/e/2PACX-1vSHesOf9hv2sPOntssYrEdubmMQm8lwjfwv6NPjjmIRYs_FOYXtqrYgjh85jBUebK9swPXh_a5TJ5Kl/pub'). I can see the table in the response (response.content). You'll need to parse that with something,beautifulsoupis my first choice, but there are others.df = pd.read_html("https://docs.google.com/document/d/e/2PACX-1vSHesOf9hv2sPOntssYrEdubmMQm8lwjfwv6NPjjmIRYs_FOYXtqrYgjh85jBUebK9swPXh_a5TJ5Kl/pub")?