Scraped Source code is incomplete - Loading Error

Question

Using requests and urllib3 I grabbed the "incomplete" source code of https://www.immowelt.de/liste/berlin/ladenflaechen . The source code is incomplete because it will only contain 4 listed items, instead of 20. Looking at the resulting Source we find the following hint for it being a "loading" / pagination problem (line number 2191). The full source code I managed to get can be inspected here: https://pastebin.com/FgTd5Z2Y

<div class="error alert js-ErrorGeneric t_center padding_top_30" id="js-ui-items_loading_error" style="display: none;">
                        Unbekannter Fehler, bitte laden Sie die Seite neu oder versuchen Sie es später erneut.
</div>

Translating the Error text: Unknown error, please reload the page or try again later.

After that Error the source code for going to the next page is shown. Sadly there exists a gab between page 1 and page 2 of 16 items.

I tried to find a solution looking deeper into the library of requests and urllib3 to find anything that would help. Thus I tried a stream instead of the simple "get". Sadly it didn't help me in any way.

import requests
import urllib3

# using requests
url = "https://www.immowelt.de/liste/berlin/ladenflaechen"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")

# using urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'https://www.immowelt.de/liste/berlin/ladenflaechen')
rip = r.data.decode('utf-8')

I expected to get all items on the page, yet only got the first 4. Source code seems to show, that the simple request command will not load the entire source code like in a browser.

QHarr · Accepted Answer · 2019-08-08 19:16:37Z

1

The page does a POST request for more results. You can do an initial request to get the total result count and a follow up POST to get all results. Note I have a preference for requests library and we have the efficiency of re-using connection with Session object.

import requests, re
from bs4 import BeautifulSoup as bs

p = re.compile(r'search_results":(.*?),')

with requests.Session() as s:  
    r = s.get('https://www.immowelt.de/liste/berlin/ladenflaechen')
    num_results = p.findall(r.text)[0]
    body = {'query': 'geoid=108110&etype=5','offset': 0,'pageSize': num_results}
    r = s.post('https://www.immowelt.de/liste/getlistitems', data = body)
    soup = bs(r.content, 'lxml')
    print(len(soup.select('.listitem')))

answered Aug 8, 2019 at 19:16

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

TiRoX Over a year ago

Thanks a lot. This looks just right. I just got to the point, that I needed a Post as well, but figuring out how to write it was going to be the next matter. I have some trouble finding the data within the soup right now, but I will find it out tomorrow. Thanks a lot for your help! It made my day. Greetings!

TiRoX Over a year ago

Hello QHarr, im again struggeling on getting my next crawler together and I wanted to use your approach. Sadly I struggle in getting the post right in order to find all results. Can you explain how you found the body 'query' and tag? I want to crawl 'immobilienscout24.de/gewerbe-flaechen/de/berlin/berlin/…' now :O

QHarr Over a year ago

If you look on my profile page there are a series of links .. a couple of those cover using network tab to find xhr that dynamically feed a page

TiRoX Over a year ago

Okay, i found some other interesting hints that lead me to my current solution using their api :) Thanks a lot for your contributions and many thanks for the knowledge share.

Collectives™ on Stack Overflow

Scraped Source code is incomplete - Loading Error

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related