1

I am scraping information from the following website: "http://www.mobygames.com/game/wheelman/view-moby-score". Here is my code

url_credit = "http://www.mobygames.com/game/wheelman/view-moby-score"
response = requests.get(url_credit, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table", class_="reviewList table table-striped table-condensed table-hover").select('tr[valign="top"]')
for row in table[1:]:
    print(row)
    x = soup.select('td[class="left"]').get("colspan")

My desired output is something like this:

platform     total_votes rating_category score  total_score
PlayStation3 None        None            None   None
Windows      6           Acting          4.2    4.1
Windows      6           AI              3.7    4.1
Windows      6           Gameplay        4.0    4.1

The main problem is having platform name on the platform column for corresponding observations. How could I get it?

1 Answer 1

1

You can see that the row which has a new platform, has 3 columns, while others have 2. You can use that to change the platform.

You can see that rows like PlayStation have a column (<td> tag) with colspan="2" class="center" attributes. Use this to handle cases like PlayStation.

Code:

url_credit = "http://www.mobygames.com/game/wheelman/view-moby-score"
response = requests.get(url_credit, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table", class_="reviewList table table-striped table-condensed table-hover").select('tr[valign="top"]')

platform = ''
total_votes, total_score = None, None
for row in table[1:]:
    # handle cases like playstation
    if row.find('td', colspan='2', class_='center'):
        platform = row.find('td').text
        total_score, total_votes = None, None
        print('{} | {} | {} | {} | {}'.format(platform, total_votes, None, None, total_score))
        continue

    cols = row.find_all('td')
    if len(cols) == 3:
        platform = cols[0].text
        total_votes = cols[1].text
        total_score = cols[2].text
        continue
    print('{} | {} | {} | {} | {}'.format(platform, total_votes, cols[0].text, cols[1].text, total_score))

Output:

PlayStation 3 | None | None | None | None
Windows | 6 |       Acting | 4.2 | 4.1
Windows | 6 |       AI | 3.7 | 4.1
Windows | 6 |       Gameplay | 4.0 | 4.1
Windows | 6 |       Graphics | 4.2 | 4.1
Windows | 6 |       Personal Slant | 4.3 | 4.1
Windows | 6 |       Sound / Music | 4.3 | 4.1
Windows | 6 |       Story / Presentation | 3.8 | 4.1
Xbox 360 | 5 |       Acting | 3.8 | 3.5
Xbox 360 | 5 |       AI | 3.2 | 3.5
Xbox 360 | 5 |       Gameplay | 3.4 | 3.5
Xbox 360 | 5 |       Graphics | 3.6 | 3.5
Xbox 360 | 5 |       Personal Slant | 3.6 | 3.5
Xbox 360 | 5 |       Sound / Music | 3.4 | 3.5
Xbox 360 | 5 |       Story / Presentation | 3.8 | 3.5

Note: By print, I mean save those values in whatever list/DataFrame you are using. I'm just using print() to show how to change the platform variable as and when needed.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you very much, actually I asked simialr question yesterday but could not apply it to new data
Indeed, I need to scrape more than one pages, and there might be many cases like playstation, is there anyway that I can keep non values for PlayStation3 ?
Can you give an example link where such a case like playstation occurs? It'll be easier to generalize after considering multiple cases.
Have a look at the edit. If this doesn't work for any other page, please share that link.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.