How to get data from <script> tag in web page source to .csv file?

Question

Example webpage: https://subwaystats.com/status-1-train-on-2017-11-27.

In the page source there's a variable called "data," which has two lists of data (labels and data) that will become my "columns" in the .csv.

<script>
...
var data = {
labels: ['12am', '00:05', '00:10', '00:15', '00:20', '00:25', ...],
...,    
data: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...],
....}
</script>

How can I get these two lists into a .csv? Any help is appreciated as I'm very new to web scraping.

Welcome to StackOverflow! Have you tried anything so far? StackOverflow isn't a free code-writing service, and expects you to try to solve your own problem first. Please update your question to show what you have already tried, showcasing a specific problem you are facing in a minimal, complete, and verifiable example. For further information, please see how to ask good questions, and take the tour of the site. — Obsidian Age
– Obsidian Age, Commented Nov 28, 2017 at 21:40
you will have to use standard string functions and/or regex. — furas
– furas, Commented Nov 28, 2017 at 21:51

user325117 · Accepted Answer · 2017-11-29 01:50:32Z

1

The only reliable way to parse JavaScript is to use a real parser like SlimIt. With SlimIt, you can define a Visitor to visit the JavaScript elements you're interested in. In your case, you want one that visits objects. Here is a visitor that finds all of the properties in an object whose name is labels or data, and whose value is an array, and prints the elements of the array:

from slimit.visitors.nodevisitor import ASTVisitor
from slimit.ast import Array

class MyVisitor(ASTVisitor):
    def visit_Object(self, node):
        """Visit object literal."""
        for prop in node:
            name = prop.left.value
            if name in ['labels', 'data'] and isinstance(prop.right, Array):
                elements = [child.value for child in prop.right.children()]
                print('{}: {}'.format(name, elements))
            else:
                self.visit(prop)

Notice how it recurses into the children of a node if it isn't one you're looking for - this allows it to find the properties you're looking for at any level (in your case, the data property is one level deeper than labels).

In order to use the visitor, you just need to download the page with requests, and parse it with Beautiful Soup, and then apply the visitor to the script elements:

from requests import get
from bs4 import BeautifulSoup
from slimit.parser import Parser

def main():
    url = 'https://subwaystats.com/status-1-train-on-2017-11-27'
    response = get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'})
    soup = BeautifulSoup(response.text, "html.parser")
    scripts = soup.find_all('script')
    parser = Parser()
    visitor = MyVisitor()
    for script in scripts:
        tree = parser.parse(script.text)
        visitor.visit(tree)

if __name__ == '__main__':
    main()

Notice that I've set the User-Agent header value to the string from a common browser. This is because the website won't return a page if it detects that the user agent is a script.

answered Nov 29, 2017 at 1:50

user325117

Sign up to request clarification or add additional context in comments.

2 Comments

Holden Greenberg Over a year ago

Thanks! This is very helpful. Looks like the website has multiple variables named "labels" and "data." Any idea on how to just get the ones I want?

user325117 Over a year ago

You need to keep count of how many you've seen, or how deep you are (how many recursive calls).

furas · Accepted Answer · 2017-11-28 22:07:44Z

You can get <script> using BeautifulSoup or lxml but later you have to use standard string functions and/or regex.

Simple example

data = '''<script>
...
var data = {
labels: ['12am', '00:05', '00:10', '00:15', '00:20', '00:25', ...],
...,    
data: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...],
....}
</script>'''

rows = data.split('\n')

for r in rows:
    r = r.strip()

    if r.startswith('labels'):
        text = r[9:-2]
        print('text:', text)
        labels = [x[1:-1] for x in text.split(', ') if x != '...']
        print('labels:', labels)

    elif r.startswith('data'):
        text = r[7:-2]
        print('text:', text)
        data = [int(x) for x in text.split(',') if x != '...']
        print('data:', data)

pairs = list(zip(labels, data))
print(pairs)

after that you can use module csv to write in file.

SIM · Accepted Answer · 2017-11-30 07:22:33Z

0

If you are not uncomfortable with hardcoding stuffs, you can get the results with fewer lines of code. Give this a shot and see what it does:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://subwaystats.com/status-1-train-on-2017-11-27', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(res.text, "lxml")
items = soup.select('script')[10]
labels = items.text.split("labels: ")[1].split("datasets:")[0].split("[")[1].split("],")[0]
data = items.text.split("data: ")[1].split("spanGaps:")[0].split("[")[1].split("],")[0]
print(labels,data)

answered Nov 30, 2017 at 7:22

SIM

22.5k6 gold badges45 silver badges116 bronze badges

2 Comments

Holden Greenberg Over a year ago

I tried this and it worked perfectly, thank you. However, after running it a few times it stopped working. Am I being blocked by the site as a bot? Is that going to kill my whole project?

SIM Over a year ago

When you find no results or any error upon executing the script then you can rest assure that you have got your ip address banned. Different sites have different policies so it depends when they are going to lift the ban. Meanwhile, you can use proxy to bypass that restriction. Thanks.

Collectives™ on Stack Overflow

How to get data from <script> tag in web page source to .csv file?

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related