0

I'm trying to scrape https://understat.com/team/Arsenal/2019 (and other EPL team pages) with BeautifulSoup4 to get links to all the player pages and eventually scrape those pages for individual player data, but have gotten stuck as I am unfamiliar with JSON data.

I've gotten as far as tracing to the part of the webpage which I'm interested in, but my current output looks like this var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x22318\x22,\x22player_name\x22\x3A\x22Pierre\x2DEmerick\x20Aubameyang\x22,...,\x22xGBuildup\x22\x3A\x220\x22\x7D\x5D');.

I can't find any information about JSON data in this format, and was wondering if someone would be able to help me get the data from this page into ideally the format of a Pandas DataFrame.

1
  • 1
    They're just obfuscating the JSON by using hex codes for punctuation characters. \x5B is [, \x7B is {, \x22 is ", etc. Commented Sep 11, 2019 at 22:34

1 Answer 1

1

All those \x sequences are hex encodings of punctuation characters like [, {, and ", to obfuscate the JSON. Python uses the same notation in its string literals, so you can decode it with ast.literal_eval().

import re
import ast
import json

line = r"var players_data = JSON.parse('\x5B\x7B\x22id\x22\x3A\x22318\x22,\x22player_name\x22\x3A\x22Pierre\x2DEmerick\x20Aubameyang\x22,\x22xGBuildup\x22\x3A\x220\x22\x7D\x5D');"
literal = re.search(r"(?<=JSON\.parse\().*(?=\);$)", line).group(0)
json_string = ast.literal_eval(literal)
players_data = json.loads(json_string)
print(players_data)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.