0

Example webpage: https://subwaystats.com/status-1-train-on-2017-11-27.

In the page source there's a variable called "data," which has two lists of data (labels and data) that will become my "columns" in the .csv.

<script>
...
var data = {
labels: ['12am', '00:05', '00:10', '00:15', '00:20', '00:25', ...],
...,    
data: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...],
....}
</script>

How can I get these two lists into a .csv? Any help is appreciated as I'm very new to web scraping.

2

3 Answers 3

1

The only reliable way to parse JavaScript is to use a real parser like SlimIt. With SlimIt, you can define a Visitor to visit the JavaScript elements you're interested in. In your case, you want one that visits objects. Here is a visitor that finds all of the properties in an object whose name is labels or data, and whose value is an array, and prints the elements of the array:

from slimit.visitors.nodevisitor import ASTVisitor
from slimit.ast import Array

class MyVisitor(ASTVisitor):
    def visit_Object(self, node):
        """Visit object literal."""
        for prop in node:
            name = prop.left.value
            if name in ['labels', 'data'] and isinstance(prop.right, Array):
                elements = [child.value for child in prop.right.children()]
                print('{}: {}'.format(name, elements))
            else:
                self.visit(prop)

Notice how it recurses into the children of a node if it isn't one you're looking for - this allows it to find the properties you're looking for at any level (in your case, the data property is one level deeper than labels).

In order to use the visitor, you just need to download the page with requests, and parse it with Beautiful Soup, and then apply the visitor to the script elements:

from requests import get
from bs4 import BeautifulSoup
from slimit.parser import Parser

def main():
    url = 'https://subwaystats.com/status-1-train-on-2017-11-27'
    response = get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'})
    soup = BeautifulSoup(response.text, "html.parser")
    scripts = soup.find_all('script')
    parser = Parser()
    visitor = MyVisitor()
    for script in scripts:
        tree = parser.parse(script.text)
        visitor.visit(tree)

if __name__ == '__main__':
    main()

Notice that I've set the User-Agent header value to the string from a common browser. This is because the website won't return a page if it detects that the user agent is a script.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! This is very helpful. Looks like the website has multiple variables named "labels" and "data." Any idea on how to just get the ones I want?
You need to keep count of how many you've seen, or how deep you are (how many recursive calls).
0

You can get <script> using BeautifulSoup or lxml but later you have to use standard string functions and/or regex.

Simple example

data = '''<script>
...
var data = {
labels: ['12am', '00:05', '00:10', '00:15', '00:20', '00:25', ...],
...,    
data: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...],
....}
</script>'''

rows = data.split('\n')

for r in rows:
    r = r.strip()

    if r.startswith('labels'):
        text = r[9:-2]
        print('text:', text)
        labels = [x[1:-1] for x in text.split(', ') if x != '...']
        print('labels:', labels)

    elif r.startswith('data'):
        text = r[7:-2]
        print('text:', text)
        data = [int(x) for x in text.split(',') if x != '...']
        print('data:', data)

pairs = list(zip(labels, data))
print(pairs)

after that you can use module csv to write in file.

Comments

0

If you are not uncomfortable with hardcoding stuffs, you can get the results with fewer lines of code. Give this a shot and see what it does:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://subwaystats.com/status-1-train-on-2017-11-27', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(res.text, "lxml")
items = soup.select('script')[10]
labels = items.text.split("labels: ")[1].split("datasets:")[0].split("[")[1].split("],")[0]
data = items.text.split("data: ")[1].split("spanGaps:")[0].split("[")[1].split("],")[0]
print(labels,data)

2 Comments

I tried this and it worked perfectly, thank you. However, after running it a few times it stopped working. Am I being blocked by the site as a bot? Is that going to kill my whole project?
When you find no results or any error upon executing the script then you can rest assure that you have got your ip address banned. Different sites have different policies so it depends when they are going to lift the ban. Meanwhile, you can use proxy to bypass that restriction. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.