0

I am getting info in html format, n have to store it. by using beautifulsoup in python i can get the specific info but have to mention the class name in the filter. But am not getting any class name of that table. I want a dict like this : {"Product":"Choclate, Honey, Shampoo", "Quantity":"3, 1, 1", "Price":"45 , 32, 16"}

and the sample html is like this: Product Quantity Price Choclate
3 ₹ 45.00
Honey
2 ₹ 32.00
Shampoo
1 ₹ 16.00
<table align="center" cellspacing="0" cellpadding="6" width="95%" style="border:0;color:#000000;line-height:150%;text-align:left;font:300 14px/30px &#39;Helvetica Neue&#39;,Helvetica,Arial,sans-serif" border=".5px"><thead><tr style="background:#efefef"><th scope="col" width="50%" style="text-align:left;border:1px solid #eee">Product</th> <th scope="col" width="30%" style="text-align:right;border:1px solid #eee">Quantity</th> <th scope="col" width="30%" style="text-align:right;border:1px solid #eee">Price</th> </tr></thead><tbody><tr width="100%"><td width="50%" style="text-align:left;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0;word-wrap:break-word">Choclate<br><small></small></td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0">3</td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:1px solid #eee;border-top:0"><span>₹ 45.00<br><small></small></span></td> </tr><tr width="100%"><td width="50%" style="text-align:left;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0;word-wrap:break-word">Honey<br><small></small></td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0">2</td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:1px solid #eee;border-top:0"><span>₹ 32.00<br><small></small></span></td> </tr><tr width="100%"><td width="50%" style="text-align:left;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0;word-wrap:break-word">Shampoo<br><small></small></td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0">1</td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:1px solid #eee;border-top:0"><span>₹ 16.00<br><small></small></span></td> </tr></tbody><tfoot><tr><td scope="col" style="text-align:left;vertical-align:middle;border-left:0;border-bottom:0;border-right:0;border-top:0;word-wrap:break-word"></td

1 Answer 1

1

You don't have to give a class name. If it is the only table simply search for the table tag, else you'll have to look at the surrounding HTML elements and the whole path from the <body> element to that table if there are any classes or identifiers or anything else to single out this particular table. If this all fails you may have search for a header cell containing the word Product and work your way up to the <table> element from there.

As I don't know the surrounding HTML I'll show the fallback solution to search for the header cell with a specific text value:

#!/usr/bin/env python
from __future__ import absolute_import, division, print_function
from pprint import pprint
from bs4 import BeautifulSoup


def main():
    with open('test.html') as html_file:
        soup = BeautifulSoup(html_file)

    header_row_node = soup.find('th', text='Product').parent
    headers = list(header_row_node.stripped_strings)
    header2values = dict((h, list()) for h in headers)
    for row_node in header_row_node.find_parent('table').tbody('tr'):
        product, quantity, price = row_node.stripped_strings
        price = price.split()[-1]  # Just take the number part.
        for header, value in zip(headers, [product, quantity, price]):
            header2values[header].append(value)

    result = dict((h, ', '.join(vs)) for h, vs in header2values.iteritems())
    pprint(result)



if __name__ == '__main__':
    main()

For the given test data (which I slightly corrected/completed before saving it as test.html) this prints:

{u'Price': u'45.00, 32.00, 16.00',
 u'Product': u'Choclate, Honey, Shampoo',
 u'Quantity': u'3, 2, 1'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.