1

Using requests and urllib3 I grabbed the "incomplete" source code of https://www.immowelt.de/liste/berlin/ladenflaechen . The source code is incomplete because it will only contain 4 listed items, instead of 20. Looking at the resulting Source we find the following hint for it being a "loading" / pagination problem (line number 2191). The full source code I managed to get can be inspected here: https://pastebin.com/FgTd5Z2Y

<div class="error alert js-ErrorGeneric t_center padding_top_30" id="js-ui-items_loading_error" style="display: none;">
                        Unbekannter Fehler, bitte laden Sie die Seite neu oder versuchen Sie es später erneut.
</div>

Translating the Error text: Unknown error, please reload the page or try again later.

After that Error the source code for going to the next page is shown. Sadly there exists a gab between page 1 and page 2 of 16 items.

I tried to find a solution looking deeper into the library of requests and urllib3 to find anything that would help. Thus I tried a stream instead of the simple "get". Sadly it didn't help me in any way.

import requests
import urllib3

# using requests
url = "https://www.immowelt.de/liste/berlin/ladenflaechen"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")

# using urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'https://www.immowelt.de/liste/berlin/ladenflaechen')
rip = r.data.decode('utf-8')

I expected to get all items on the page, yet only got the first 4. Source code seems to show, that the simple request command will not load the entire source code like in a browser.

1 Answer 1

1

The page does a POST request for more results. You can do an initial request to get the total result count and a follow up POST to get all results. Note I have a preference for requests library and we have the efficiency of re-using connection with Session object.

import requests, re
from bs4 import BeautifulSoup as bs

p = re.compile(r'search_results":(.*?),')

with requests.Session() as s:  
    r = s.get('https://www.immowelt.de/liste/berlin/ladenflaechen')
    num_results = p.findall(r.text)[0]
    body = {'query': 'geoid=108110&etype=5','offset': 0,'pageSize': num_results}
    r = s.post('https://www.immowelt.de/liste/getlistitems', data = body)
    soup = bs(r.content, 'lxml')
    print(len(soup.select('.listitem')))
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot. This looks just right. I just got to the point, that I needed a Post as well, but figuring out how to write it was going to be the next matter. I have some trouble finding the data within the soup right now, but I will find it out tomorrow. Thanks a lot for your help! It made my day. Greetings!
Hello QHarr, im again struggeling on getting my next crawler together and I wanted to use your approach. Sadly I struggle in getting the post right in order to find all results. Can you explain how you found the body 'query' and tag? I want to crawl 'immobilienscout24.de/gewerbe-flaechen/de/berlin/berlin/…' now :O
If you look on my profile page there are a series of links .. a couple of those cover using network tab to find xhr that dynamically feed a page
Okay, i found some other interesting hints that lead me to my current solution using their api :) Thanks a lot for your contributions and many thanks for the knowledge share.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.