How to scrape <script text/javascript>

Question

so I am trying to figure out how I can possible scrape a javascript tag using regex which I believe might be the easiest way.

The tag looks like:

<script type="text/javascript">

var spConfig=newApex.Config({
  "attributes": {
    "199": {
      "id": "199",
      "code": "legend",
      "label": "Weapons",
      "options": [
        {
          "label": "10",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "10.5",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "11",
          "priceInGame": "0",          
          "id": [
            "66659"
          ]
        },
        {
          "label": "11.5",
          "priceInGame": "0",          
          "id": [            
          ]
        },
        {
          "label": "12",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "12.5",
          "priceInGame": "0",          
          "id": [           
          ]
        },
        {
          "label": "13",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "4",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "4.5",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "5",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "5.5",
          "priceInGame": "0",        
          "id": [

          ]
        },
        {
          "label": "6",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "6.5",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "7",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "7.5",
          "priceInGame": "0",         
          "id": [

          ]
        },
        {
          "label": "8",
          "priceInGame": "0",          
          "id": [
            "66672"
          ]
        },
        {
          "label": "8.5",
          "priceInGame": "0",          
          "id": [
            "66673"
          ]
        },
        {
          "label": "9",
          "priceInGame": "0",          
          "id": [

          ]
        },
        {
          "label": "9.5",
          "priceInGame": "0",        
          "id": [
            "66675"
          ]
        }
      ]
    }
  },
  "weaponID": "66733",
  "chooseText": "Apex Legends",
  "Config": {
    "includeCoins": false,
  }
});

</script>

and I want to scrape all Label

Whaht I tried to do is:

        for nosto_sku_tag in bs4.find_all('script', {'type': 'text/javascript'}):
            try:
                test = re.findall('var spConfig = (\{.*}?);', nosto_sku_tag.text.strip())
                print(test)
            except:  # noqa
                continue

but it only returned an empty value of []

so I am here asking what can I do to be able to scrape the labels?

Please note that the type="text/javascript" is no longer needed (introduced I think with html5), so if you are going to crawl the web it won't be there on every page — inetphantom
– inetphantom, Commented Jul 17, 2019 at 8:40

abdusco · Accepted Answer · 2022-12-24 05:29:56Z

2

You need to specify the attribute using attr=value or attrs={'attr': 'value'} syntax.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

import json
import re

from bs4 import BeautifulSoup

if __name__ == '__main__':
    html = '''
<script type="text/javascript">

var spConfig=newApex.Config({
  "attributes": {
    "199": {
      "id": "199",
      "code": "legend",
      "label": "Weapons",
      "options": [
        { "label": "10", "priceInGame": "0", "id": [] },
        { "label": "10.5", "priceInGame": "0", "id": [] },
        { "label": "11", "priceInGame": "0", "id": [ "66659" ] },
        { "label": "7.5", "priceInGame": "0", "id": [] },
        { "label": "8", "priceInGame": "0", "id": ["66672"] }
      ]
    }
  },
  "weaponID": "66733",
  "chooseText": "Apex Legends",
  "taxConfig": {
    "includeCoins": false,
  }
});

</script>    
    '''

    soup = BeautifulSoup(html, 'html.parser')
    # this one works too
    # script = soup.find('script', attrs={'type':'text/javascript'})
    script = soup.find('script', type='text/javascript')
    js: str = script.text.replace('\n', '')
    raw_json = re.search('var spConfig=newApex.Config\(({.*})\);', js, flags=re.MULTILINE).group(1)
    data = json.loads(raw_json)
    labels = [opt['label'] for opt in data['attributes']['199']['options']]
    print(labels)

output:

['10', '10.5', '11', '7.5', '8'] ... some removed for brevity

edited Dec 24, 2022 at 5:29

answered Jul 17, 2019 at 8:33

abdusco

11.3k3 gold badges38 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

AKX Over a year ago

json.loads() would be more appropriate than ast.literal_eval() here.

abdusco Over a year ago

I agree. But it's more forgiving for syntax errors (extra commas etc) and non-json strings, though.

abdusco Over a year ago

"includeCoins": False, is a syntax error for example (possibly a typo), so json.loads doesnt work here

Thrillofit86 Over a year ago

Hello! So I am just curious because I am not as knowledge as you but wouldn't it better to just do regex to match newApex.Config and grab the json inside the tag and then use the json.loads with it? What is different between doing through regex and the way you did it? @abdusco

Thrillofit86 Over a year ago

@abdusco Right you are actually right. I think I do get it now how it is suppoed to work and I think I am finished here thanks to you!

|

Tiernan · Accepted Answer · 2019-07-17 08:41:30Z

0

If you are just looking for the entire row field in the JSON object, use the following;

("label":) "([^"]+)",

Then if you want to return the actual value, just use

\2

to pull back the second group

answered Jul 17, 2019 at 8:41

Tiernan

335 bronze badges

Collectives™ on Stack Overflow

How to scrape <script text/javascript>

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related