0

I have:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

listing_price = []
listing_nobed = []

for c in containers:
    listing_price.append(c.findAll("a",{"class":"listing-results-price text-price"}))
    listing_nobed.append(c.findAll("h3",{"class":"listing-results-attr"}))

print(listing_price[0])
print('----------------------------')
print(listing_nobed[0])

results:

[<a class="listing-results-price text-price" href="/for-sale/details/50924268">




        £500,000







                <span class="price-modifier">Offers over</span>
</a>]
----------------------------
[<h3 class="listing-results-attr">
<span class="num-icon num-beds" title="3 bedrooms"><span class="interface"></span>3</span> <span class="num-icon num-baths" title="1 bathroom"><span class="interface"></span>1</span> <span class="num-icon num-reception" title="2 reception rooms"><span class="interface"></span>2</span>
</h3>]

I want:

Price   NoBeds NoBaths NoRec
500,000 3      1       2
xxx     x      x       NaN

Where xxx is the price, etc. Some of the values do not have a tag, so if that is the case, then show NaN or 0

I tried Python - Beautiful Soup - Remove Tags to to extract the (3,1,2) values to no avail.

To extract the price, I thought of using regex, but found many comments here do not recommend it.

I am still trying to understand html tags and data extractions, so any suggestions are greatly appreciated.

1
  • Are you looking for the .string attribute of the tag? rather than appending the entire tag it looks like you only inted to extract the text itself Commented Mar 28, 2019 at 16:21

1 Answer 1

1

You can use next() to find any next elements and for cleaning text() strip()

from bs4 import BeautifulSoup as soup
import requests
my_url='https://www.zoopla.co.uk/for-sale/property/london/west-wickham/?q=West%20Wickham%2C%20London&results_sort=newest_listings&search_source=home'

req = requests.get(my_url)
page_soup = soup(req.content,'html.parser')

containers = page_soup.findAll("div",{"class":"listing-results-wrapper"}) 

for c in containers:
    a = c.find("a",{"class":"listing-results-price text-price"})
    b = c.find("h3",{"class":"listing-results-attr"})

    NoBedsx = b.find('span',{'class':'num-icon num-beds'})
    NoBathsx = b.find('span',{'class':'num-icon num-baths'})
    NoRecx = b.find('span',{'class':'num-icon num-reception'})

    if a:
        Price = a.next.strip().encode('utf-8')
    if NoBedsx:
        NoBeds = NoBedsx.next.next.encode('utf-8')
    if NoBathsx:
        NoBaths = NoBathsx.next.next.encode('utf-8')
    if NoRecx:
        NoRec = NoRecx.next.next.encode('utf-8')
    print('{} {} {} {}'.format(Price,NoBeds,NoBaths,NoRec))

Output:

Price  NoBeds NoBaths NoRec
£500,000 3 1 2
£337,500 4 2 1
£875,000 5 2 2
£695,000 4 1 2
£190,000 1 1 1
£670,000 4 2 1
£610,000 3 2 2
£675,000 4 2 1
£580,000 4 2 1
£850,000 5 2 1
£185,000 1 2 1
£760,000 5 2 1
£675,000 3 2 1
£142,000 1 2 1
£550,000 2 2 1
£817,000 4 2 1
£139,000 1 2 1
£625,000 3 1 2
£145,000 1 1 2
£725,000 4 1 2
£799,995 4 1 2
£575,000 3 1 2
£465,000 3 1 2
£725,000 4 2 2
£465,000 4 2 2
Sign up to request clarification or add additional context in comments.

1 Comment

Love it! Thanks!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.