3

I am trying to web scrape a HTML table using python. I am using beautiful soup to do this web scraping. There are many tables in the HTML page and there are many rows in the table. I want each row to have a different name and if there are columns in the row, want them to be separate.

My code looks like this:

page = get("https://www.4dpredict.com/mysingaporetoto.p3.html")
html = BeautifulSoup(page.content, 'html.parser')
result = defaultdict(list)
tables = html.find_all('table')
for table in tables:
    for row in table.find_all('tr')[0:15]:
        try:
            #stuck here
        except ValueError:
            continue  # blank/empty row

Need some guidance on this.

4 Answers 4

1

If I correctly understood your requirement, the following script should do the trick:

import requests
from bs4 import BeautifulSoup

url = 'https://www.4dpredict.com/mysingaporetoto.p3.html'

res = requests.get(url).text
soup = BeautifulSoup(res, 'lxml')
num = 0
for tables in soup.select("table tr"):
    num+=1
    data = [f'{num}'] + [item.get_text(strip=True) for item in tables.select("td")]
    print(data)

Partial output:

['1', 'SINGAPORE TOTO2018-08-23 (Thu) 3399']
['2', 'WINNING NUMBERS']
['3', '02', '03', '23', '30', '39', '41']
['4', 'ADDITIONAL']
['5', '19']
['6', 'Prize:$2,499,788']
['7', 'WINNING SHARES']
['8', 'Group', 'Share Amt', 'Winners']
['9', 'Group 1', '$1,249,894', '2']
['10', 'Group 2', '$', '-']
['11', 'Group 3', '$1,614', '124']
['12', 'Group 4', '$344', '318']
['13', 'Group 5', '$50', '6,876']
['14', 'Group 6', '$25', '9,092']
Sign up to request clarification or add additional context in comments.

4 Comments

If you are using python's latest version, I doubt there is any error. Check out the ouput the script produces.
how to fix that?
Try this ['%s' % num] replacing with [f'{num}'].
Then I failed to understand your requirement. Thanks.
1

Kindly check the below code, let me know if that doesn't works,

import requests
from bs4 import BeautifulSoup
import pprint
page = requests.get("https://www.4dpredict.com/mysingaporetoto.p3.html")
html = BeautifulSoup(page.content, 'html.parser')

tables = html.find_all('table')
table_data = dict()
for table_id, table in enumerate(tables):
    print('[!] Scraping Table -', table_id + 1)
    table_data['table_{}'.format(table_id+1)] = dict()
    table_info = table_data['table_{}'.format(table_id+1)]
    for row_id, row in enumerate(table.find_all('tr')):
        col = []
        for val in row.find_all('td'):
            val = val.text
            val = val.replace('\n', '').strip()
            if val:
                col.append(val)
        table_info['row_{}'.format(row_id+1)] = col
    pprint.pprint(table_info)
    print('+-+' * 20)

pprint.pprint(table_data)

Sample Output

[!] Scraping Table - 1
{'row_1': ['SINGAPORE TOTO2018-08-23 (Thu) 3399'],
 'row_10': ['Group 2', '$', '-'],
 'row_11': ['Group 3', '$1,614', '124'],
 'row_12': ['Group 4', '$344', '318'],
 'row_13': ['Group 5', '$50', '6,876'],
 'row_14': ['Group 6', '$25', '9,092'],
 'row_15': ['Group 7', '$10', '117,080'],
 'row_16': ['SHOW ANALYSISEVEN : ODD, 2 : 5SUM :138, AVERAGE :23 MIN :02, MAX '
            ':41, DIFF :39',
            'EVEN : ODD, 2 : 5',
            'SUM :138, AVERAGE :23',
            'MIN :02, MAX :41, DIFF :39'],
 'row_17': ['EVEN : ODD, 2 : 5'],
 'row_18': ['SUM :138, AVERAGE :23'],
 'row_19': ['MIN :02, MAX :41, DIFF :39'],
 'row_2': ['WINNING NUMBERS'],
 'row_3': ['02', '03', '23', '30', '39', '41'],
 'row_4': ['ADDITIONAL'],
 'row_5': ['19'],
 'row_6': ['Prize: $2,499,788'],
 'row_7': ['WINNING SHARES'],
 'row_8': ['Group', 'Share Amt', 'Winners'],
 'row_9': ['Group 1', '$1,249,894', '2']}
+-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-+

Comments

0

I would suggest abandoning BeautifulSoup (beautiful though it is) and using pandas (which uses BeautifulSoup or lxml on the back end). What you describe is bog standard with pandas, just read the docs.

Comments

0

I will suggest to use requests.get() instead of get() method

3 Comments

Could you please use some of the OP code to enhance your answer? to make it clear what line of code will be the answer to OP question.
OP seems to have used requests library. However, he perhaps imported get from it like from requests import get. I still can't find any relevance between the answer to the question and your oneliner comment..
Thanks SIM for suggestion.. I am new to python as well as stack over flow. try to learning and solving ..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.