Scrape JSON Data

Question

I'm trying to scrape https://understat.com/team/Arsenal/2019 (and other EPL team pages) with BeautifulSoup4 to get links to all the player pages and eventually scrape those pages for individual player data, but have gotten stuck as I am unfamiliar with JSON data.

I've gotten as far as tracing to the part of the webpage which I'm interested in, but my current output looks like this var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x22318\x22,\x22player_name\x22\x3A\x22Pierre\x2DEmerick\x20Aubameyang\x22,...,\x22xGBuildup\x22\x3A\x220\x22\x7D\x5D');.

I can't find any information about JSON data in this format, and was wondering if someone would be able to help me get the data from this page into ideally the format of a Pandas DataFrame.

They're just obfuscating the JSON by using hex codes for punctuation characters. \x5B is [, \x7B is {, \x22 is ", etc. — Barmar
– Barmar, Commented Sep 11, 2019 at 22:34

Barmar · Accepted Answer · 2019-09-11 23:46:02Z

1

All those \x sequences are hex encodings of punctuation characters like [, {, and ", to obfuscate the JSON. Python uses the same notation in its string literals, so you can decode it with ast.literal_eval().

import re
import ast
import json

line = r"var players_data = JSON.parse('\x5B\x7B\x22id\x22\x3A\x22318\x22,\x22player_name\x22\x3A\x22Pierre\x2DEmerick\x20Aubameyang\x22,\x22xGBuildup\x22\x3A\x220\x22\x7D\x5D');"
literal = re.search(r"(?<=JSON\.parse\().*(?=\);$)", line).group(0)
json_string = ast.literal_eval(literal)
players_data = json.loads(json_string)
print(players_data)

edited Sep 11, 2019 at 23:46

answered Sep 11, 2019 at 22:43

Barmar

789k57 gold badges555 silver badges669 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scrape JSON Data

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related