-1

I've got this google doc URL that I need to extract the contents from, specifically the contents in the table.

https://docs.google.com/document/d/e/2PACX-1vSHesOf9hv2sPOntssYrEdubmMQm8lwjfwv6NPjjmIRYs_FOYXtqrYgjh85jBUebK9swPXh_a5TJ5Kl/pub

I need to do this in python but I'm lost on how to extract that data.

I've tried the requests module and I've looked into the Docs API but I'm not understanding how to use them correctly since they just throw me errors.

5
  • In your situation, do you know the document ID of the Google Document you want to retrieve? Also, the Google Document is publicly shared? Commented Sep 6, 2024 at 7:05
  • Post what you have tried and a minimal reproducible example. Stack Overflow can provide much higher quality help with a specific problem. Commented Sep 6, 2024 at 7:06
  • To get you started, I was able to get the page with response = requests.get('https://docs.google.com/document/d/e/2PACX-1vSHesOf9hv2sPOntssYrEdubmMQm8lwjfwv6NPjjmIRYs_FOYXtqrYgjh85jBUebK9swPXh_a5TJ5Kl/pub'). I can see the table in the response (response.content). You'll need to parse that with something, beautifulsoup is my first choice, but there are others. Commented Sep 6, 2024 at 7:10
  • I also suggest inspecting the page with your browser's tools to find what you want (Chrome -> More Tools -> Developer Tools -> Elements is an example). It provides a nice GUI for inspecting it. It's much easier to grok than the raw response. Commented Sep 6, 2024 at 7:15
  • 2
    As a simple approach, how about df = pd.read_html("https://docs.google.com/document/d/e/2PACX-1vSHesOf9hv2sPOntssYrEdubmMQm8lwjfwv6NPjjmIRYs_FOYXtqrYgjh85jBUebK9swPXh_a5TJ5Kl/pub")? Commented Sep 6, 2024 at 7:50

2 Answers 2

1

As @augursol stated in their answer, this question comes from a developer aptitude test/job application, so can't be answered directly here. Instead, I'm going to provide instructions in English, which you can implement in code.

  1. Create a curses screen for character grid display. Ideally, use curses.wrapper(callbackFunction) for error handling and clean-up.(You might want to create a window/panel with a statusbar line for debug output, but this isn't necessary.

  2. From within the screen setup code, call a getTableData () function that gets the URL of the Google Doc from argv[1]. (Remember that argv[0] is the name of the script.)

  3. Use the requests module/package to GET the Google Doc.

  4. Use BeautifulSoup to parse the HTML, searching for the table element.

  5. Loop through the rows, getting the cell contents as an array.

  6. Return the data from the getTableData () function

  7. Back in the curses code, process the row data in order to plot the characters within the grid. (You'll need to add code handling that checks if the X and Y coordinate data is numeric, using string.isnumeric ())

Sign up to request clarification or add additional context in comments.

Comments

0

I solved this - since it's for a job application I'm not going to provide the full solution but I can point you in the right direction. I don't think you need to complicate things with Google Docs APIs - if you look at the response from the url, you'll notice that the data you want is in a table tag and that there is only one table element in the document. You can use an HTML parser to fetch this data pretty easily. Suggest you look at html.parser

For me steps were:

  1. Create a parser that finds the table data you want and extracts the values from each row (start by just printing them to the console). You'll need to have a way to track when you hit a start tag for a table and also an end tag for a table. Also account for the fact that the first row has no data (column headers)

  2. Create some data object that will be populated by the parser - I used a List of List objects. You'll want to call a method that updates this after reading a row of data from the table (every 3 values). You'll also need to handle padding - if you get a value that "skips" columns or lines, the skipped lines should be padded with empty lines and the skipped columns should be padded with spaces.

  3. You'll need a method to print this data object. Note that the (0,0) value is in the bottom left of the image

This should be enough to get you started:

import requests
from html.parser import HTMLParser

class GoogleDocDecoderHTMLParser(HTMLParser):
    inTable = False
    
    def handle_starttag(self, tag, attrs):
        if tag == "table":
            self.inTable = True

    def handle_endtag(self, tag):
        if tag == "table":
            self.inTable = False

    def handle_data(self, data):
        if self.inTable:
            print(data)
        
            
class GoogleDocDecoder:
    URL = "https://docs.google.com/document/d/e/2PACX-1vQGUck9HIFCyezsrBSnmENk5ieJuYwpt7YHYEzeNJkIb9OSDdx-ov2nRNReKQyey-cwJOoEKUhLmN9z/pub"
    
    def decodeFromUrl(url):
        response = requests.get(url)
        html_content = response.content

        parser = GoogleDocDecoderHTMLParser()
        parser.feed(response.text)

    decodeFromUrl(URL)

2 Comments

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review
This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From Review

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.