0

I am trying to use BeautifulSoup to scrape a table whose information I only want from one column. I have put this code in a function so that I can more easily apply this to multiple pages. As soon as I call the function multiple times I get multiple lists, but as soon as I want to convert this list into a dataframe I get the results in columns instead of rows.

total_points = []

def getTotalpoints(tag):
    url = f'https://www.procyclingstats.com/team/{tag}/analysis/start'
    html_content = requests.get(url).text
    soup = BeautifulSoup(html_content, "lxml")

    team_riders = soup.find_all("table", attrs={"class": "basic"})

    table = soup.findAll('table')[0]
    rows = table.findAll('tr')
    heading = table.find('tr')

    headings = []
    for item in heading.find_all("th"): # loop through all th elements
        # convert the th elements to text and strip "\n"
        item = (item.text).rstrip("\n")
        # append the clean column name to headings
        headings.append(item)
    headings_true = headings[4]
    # print(headings)

  
    points = []
    for row in rows[1:]:
        points.append(row.findAll('td')[4].text)

    total_points.append(points)
    
    return

getTotalpoints('astana-pro-team-2010')
getTotalpoints('astana-pro-team-2013')
getTotalpoints('astana-pro-team-2016')

print(total_points)

[['1372', '1076', '581', '579', '334', '288', '282', '222', '183', '146', '116', '106', '106', '102', '78', '77', '68', '54', '43', '41', '40', '38', '25', '11', '10', '5', '5'], ['2225', '838', '682', '538', '457', '456', '411', '410', '329', '286', '284', '237', '205', '196', '150', '114', '110', '109', '104', '72', '68', '67', '56', '46', '45', '28', '16', '10', '10'], ['1178', '849', '772', '701', '663', '572', '548', '530', '355', '267', '249', '247', '239', '200', '188', '175', '160', '133', '113', '109', '96', '75', '74', '68', '50', '40', '38', '37', '31', '5', '', '']]


df = pd.DataFrame(total_points)

print(df)

 0     1    2    3    4    5    6    7    8    9   ...  22  23  24  25  \
0  1372  1076  581  579  334  288  282  222  183  146  ...  25  11  10   5   
1  2225   838  682  538  457  456  411  410  329  286  ...  56  46  45  28   
2  1178   849  772  701  663  572  548  530  355  267  ...  74  68  50  40   

   26    27    28    29    30    31  
0   5  None  None  None  None  None  
1  16    10    10  None  None  None  
2  38    37    31     5       

  

How can i achieve that every list becomes it's own column with all the rows under it? I would like to have the results like:

column 1 column 2 column 3
row 1    row 1      row 1
row 2    row 2      row 2
row 3    row 3      row 3
row 4    row 4      row 4
etc      etc        etc

So every list in its own column instead of every row in its own column.

Thanks for your answers!

3
  • Hello, it would help if you could provide an example of how you want the result to look like. Commented Jan 28, 2021 at 14:20
  • Hi, i have updated my question with an example! I hope it makes sense. Commented Jan 28, 2021 at 14:31
  • Thank you for editing the question. The minimal expected data is great. Just that it isn't matching up with the input. If you could provide the same for the input, that would be ideal. Perhaps create a DataFrame using examples rather than fetching the data. Commented Jan 28, 2021 at 14:40

1 Answer 1

1

If you know the column names and their number matches with the number of the inner lists, then you can do as follows.

import pandas as pd

total_points = [
    [1, 2, 3, 4, 5],
    [4, 5, 6, 7, 8],
    [5, 6, 7, 8, 9],
]

col_names = ['col1', 'col2', 'col3']

df = pd.DataFrame(zip(*total_points), columns=col_names)
print(df)

Output

   col1  col2  col3
0     1     4     5
1     2     5     6
2     3     6     7
3     4     7     8
4     5     8     9

Here zip is used to make a transpose operation, so that DataFrame initializer correctly treats your inner lists as columns in the resulting dataframe.

Sign up to request clarification or add additional context in comments.

2 Comments

But I have just one issue left. When I add columns and zip the list together i miss some output. Original my outcome has 31 rows, but when I zip the data together all the lists end at 27 rows, so I miss some rows. I think all the rows keep the length from the first list, that is 27 rows long. Can I fix that?
@NielsO You can use zip_longest from itertools module. Simply import from itertools import zip_longest and use it instead of zip.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.