Extract information from html tags using beautiful soup python

Question

I have:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

listing_price = []
listing_nobed = []

for c in containers:
    listing_price.append(c.findAll("a",{"class":"listing-results-price text-price"}))
    listing_nobed.append(c.findAll("h3",{"class":"listing-results-attr"}))

print(listing_price[0])
print('----------------------------')
print(listing_nobed[0])

results:

[<a class="listing-results-price text-price" href="/for-sale/details/50924268">




        £500,000







                <span class="price-modifier">Offers over</span>
</a>]
----------------------------
[<h3 class="listing-results-attr">
<span class="num-icon num-beds" title="3 bedrooms"><span class="interface"></span>3</span> <span class="num-icon num-baths" title="1 bathroom"><span class="interface"></span>1</span> <span class="num-icon num-reception" title="2 reception rooms"><span class="interface"></span>2</span>
</h3>]

I want:

Price   NoBeds NoBaths NoRec
500,000 3      1       2
xxx     x      x       NaN

Where xxx is the price, etc. Some of the values do not have a tag, so if that is the case, then show NaN or 0

I tried Python - Beautiful Soup - Remove Tags to to extract the (3,1,2) values to no avail.

To extract the price, I thought of using regex, but found many comments here do not recommend it.

I am still trying to understand html tags and data extractions, so any suggestions are greatly appreciated.

Are you looking for the .string attribute of the tag? rather than appending the entire tag it looks like you only inted to extract the text itself — G. Anderson
– G. Anderson, Commented Mar 28, 2019 at 16:21

Sohan Das · Accepted Answer · 2019-03-28 16:24:05Z

You can use next() to find any next elements and for cleaning text() strip()

from bs4 import BeautifulSoup as soup
import requests
my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

req = requests.get(my_url)
page_soup = soup(req.content,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

for c in containers:
    a = c.find("a",{"class":"listing-results-price text-price"})
    b = c.find("h3",{"class":"listing-results-attr"})

    NoBedsx = b.find('span',{'class':'num-icon num-beds'})
    NoBathsx = b.find('span',{'class':'num-icon num-baths'})
    NoRecx = b.find('span',{'class':'num-icon num-reception'})

    if a:
        Price = a.next.strip().encode('utf-8')
    if NoBedsx:
        NoBeds = NoBedsx.next.next.encode('utf-8')
    if NoBathsx:
        NoBaths = NoBathsx.next.next.encode('utf-8')
    if NoRecx:
        NoRec = NoRecx.next.next.encode('utf-8')
    print('{} {} {} {}'.format(Price,NoBeds,NoBaths,NoRec))

Output:

Price  NoBeds NoBaths NoRec
£500,000 3 1 2
£337,500 4 2 1
£875,000 5 2 2
£695,000 4 1 2
£190,000 1 1 1
£670,000 4 2 1
£610,000 3 2 2
£675,000 4 2 1
£580,000 4 2 1
£850,000 5 2 1
£185,000 1 2 1
£760,000 5 2 1
£675,000 3 2 1
£142,000 1 2 1
£550,000 2 2 1
£817,000 4 2 1
£139,000 1 2 1
£625,000 3 1 2
£145,000 1 1 2
£725,000 4 1 2
£799,995 4 1 2
£575,000 3 1 2
£465,000 3 1 2
£725,000 4 2 2
£465,000 4 2 2

Collectives™ on Stack Overflow

Extract information from html tags using beautiful soup python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related