Web scraping in python using BeautifulSoup - how to transpose results?

Question

I built the code below and am having issues of how to transpose the results. Effectively I am looking for the following result:

#    Column headers: 'company name',  'Work/Life Balance',   'Salary/Benefits',  'Job Security/Advancement', 'Management', 'Culture'  
#    Row 1: 3M, 3.8, 3.9, 3.5, 3.6, 3.8
#    Row 2: Google, . . .

Currently what happens is as follows:

#    Column headers: 'Name', 'Rating', 'Category'
#    Row 1: 3M, 3.8, Work/Life Balance
#    Row 2: 3M, 3.9, Salary/Benefits
#    and so on . . .

My code thus far:

import  requests
import pandas as pd
from bs4 import BeautifulSoup


number = []
category = []
name = []
company = ['3M', 'Google']
for company_name in company:
    try:
        url = 'https://ca.indeed.com/cmp/'+company_name
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        rating = soup.find(class_='cmp-ReviewAndRatingsStory-rating')
        rating = rating.find('tbody')
        rows = rating.find_all('tr')
    except:
        pass
    for row in rows:
        number.append(str(row.find_all('td')[0].text))
        category.append(str(row.find_all('td')[2].text))
        name.append(company_name)
    cols = {'Name':name,'Rating':number,'Category':category}
    df = pd.DataFrame(cols)
    print(df)

What the code produces:

      Name Rating                  Category
0       3M    3.8         Work/Life Balance
1       3M    3.9           Salary/Benefits
2       3M    3.5  Job Security/Advancement
3       3M    3.6                Management
4       3M    3.8                   Culture
5   Google    4.2         Work/Life Balance
6   Google    4.0           Salary/Benefits
7   Google    3.6  Job Security/Advancement
8   Google    3.9                Management
9   Google    4.2                   Culture
10   Apple    3.8         Work/Life Balance
11   Apple    4.1           Salary/Benefits
12   Apple    3.7  Job Security/Advancement
13   Apple    3.7                Management
14   Apple    4.1                   Culture

replicate result by using code below:

import pandas as pd
name = ['3M','3M','3M','3M','3M','Google','Google','Google','Google','Google','Apple','Apple','Apple','Apple','Apple']
number = ['3.8','3.9','3.5','3.6','3.8','4.2','4.0','3.6','3.9','4.2','3.8','4.1','3.7','3.7','4.1']
category = ['Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture']
cols = {'Name':name,'Rating':number,'Category':category}
df = pd.DataFrame(cols)
print(df)

I didn't downvote, but I can think of some reasons why somebody might have. For one thing, the question is fundamentally about transposing a dataframe, so it seems unnecessary to put requests and beautifulsoup code in your MCVE. Just provide code that produces the dataframe without requiring the user to pull the data from the web. Second, the code you do have is improperly formatted. When I run it, i get IndentationError: unindent does not match any outer indentation level on the cols = line. — Kevin
– Kevin, Commented Jul 19, 2019 at 19:48
I see. Thank you for the added information. I have edited my initial question for clarity and fixed the code. Is there a way you can try again and advise? — g3lo
– g3lo, Commented Jul 19, 2019 at 19:52
Thanks for fixing the IndentationError. Unfortunately, ca.indeed.com is blocked by my company firewall, so unless you provide code that creates the dataframe without scraping it from the Internet, I can't investigate further. — Kevin
– Kevin, Commented Jul 19, 2019 at 19:54
I added the result of the code to my initial post. Not sure how to create a code that would reproduce such results. Are you able to assist with the result I provided above? — g3lo
– g3lo, Commented Jul 19, 2019 at 20:00

Kevin · Accepted Answer · 2019-07-19 21:04:55Z

1

Here's one possible approach.

import pandas as pd
name = ['3M','3M','3M','3M','3M','Google','Google','Google','Google','Google','Apple','Apple','Apple','Apple','Apple']
number = ['3.8','3.9','3.5','3.6','3.8','4.2','4.0','3.6','3.9','4.2','3.8','4.1','3.7','3.7','4.1']
category = ['Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture','Work/Life Balance',' Salary/Benefits','Job Security/Advancement','Management','Culture']
cols = {'Name':name,'Rating':number,'Category':category}
df = pd.DataFrame(cols)
print(df)



from collections import defaultdict
aggregated_data = defaultdict(dict)
for idx, row in df.iterrows():
    aggregated_data[row.Name][row.Category] = row.Rating

result = pd.DataFrame(aggregated_data).T
print(result)

Result:

        Salary/Benefits Culture Job Security/Advancement Management Work/Life Balance
3M                  3.9     3.8                      3.5        3.6               3.8
Google              4.0     4.2                      3.6        3.9               4.2
Apple               4.1     4.1                      3.7        3.7               3.8

I don't think this is the "idiomatic" approach. Since it uses native Python data types and loops, it's probably considerably slower than a pure pandas solution. But if your data isn't that big, maybe that's OK.

Edit: I think transposing in that last step there is causing the column names to get put in a surprising order, so here's an approach that constructs the final dataframe from a list of dicts instead.

from collections import defaultdict
data_by_name = defaultdict(dict)
for idx, row in df.iterrows():
    data_by_name[row.Name][row.Category] = row.Rating

aggregated_rows = [{"company name": name, **ratings} for name, ratings in data_by_name.items()]
result = pd.DataFrame(aggregated_rows)
print(result)

Result:

  company name Work/Life Balance  Salary/Benefits Job Security/Advancement Management Culture
0           3M               3.8              3.9                      3.5        3.6     3.8
1       Google               4.2              4.0                      3.6        3.9     4.2
2        Apple               3.8              4.1                      3.7        3.7     4.1

edited Jul 19, 2019 at 21:04

answered Jul 19, 2019 at 20:24

Kevin

76.5k13 gold badges141 silver badges168 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Kevin Over a year ago

Hmm, I notice that the leftmost column isn't labeled with "company name", and the other columns aren't in the order that they first appear in the original data... I'm not sure why that is. I'll try to fix that later.

g3lo Over a year ago

Never mind it worked, "Rating" is what I needed. I had "rating". I will mark it as solved, however, if you have the opportunity to ensure that the values and headers are correct, I would greatly appreciate it.

g3lo Over a year ago

Wonderful! It works perfectly. Is there a way you can provide comments to the code so that I can learn what each line of code does?

g3lo Over a year ago

i noticed the new line of code doesnt produce the results as your showing in the same order, I get a different order after running it.

Collectives™ on Stack Overflow

Web scraping in python using BeautifulSoup - how to transpose results?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related