URL Targeted web crawler [closed]

Ask Question

Asked 6 days ago

Modified 6 days ago

Viewed 77 times

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 6 days ago.

The community reviewed whether to reopen this question 6 days ago and left it closed:

Not suitable for this site

Improve this question

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to a txt file, and if not it will skip.

I was able to determine that if a post does not exist, then tumblr will have a specific class name for a div where it shows the "post not found" dialogue, which means I can check to see if that div exists and if not then it will save the URL.

The problem I am running into is that BeautifulSoup, Selenium, and Playwright do not want to work no matter what. I have tried doing delayed starts in case the page was still loading, but I am starting to run out of ideas. Anyone have any thoughts on this one?

The code is fairly basic, what it is supposed to do is check for a status code of 200 to ensure the server processed and responded, then checks the page elements for the class name. The next steps are dependent on the outcome of that logic test.

**I was asked to include more info about what it is searching for. So I am copying this response over from the comments to provide greater clarity.

If you pick any tumblr post on tumblr and then go up 1 number it will normally not be a working link. e.g. tumblr.com/funny-text-posts/664730352044179456 works & tumblr.com/funny-text-posts/664730352044179457 does not.

You'll notice that you will get one of 5 specific text responses: This post went to heaven. This post is gone, gone, gone. But there's more, more, more, on funny-text-posts's Tumblr. This post has ceased to exist. You're too late. This post is no more. This post isn't here anymore, but the Tumblr still is.

If you search for that text in inspector you will find the class name only on the page where the automated response is listed. And before anyone asks, yes I have also tried searching for the text directly by adding all 5 responses to an array, which obviously also failed

import requests
from bs4 import BeautifulSoup

def page_has_div_class(url, class_name):
try:
    r = requests.get(url, timeout=10)
    if r.status_code != 200:
        return False  # Page not valid

    soup = BeautifulSoup(r.text, "html.parser")

    # Find ANY <div> whose full class attribute matches exactly
    for div in soup.find_all("div"):
        if div.get("class") and " ".join(div.get("class")) == class_name:
            return True

    return False

except Exception as e:
    print("Error:", e)
    return False

url = "https://www.tumblr.com/BLOG_NAME_TO_SCAN/posts/1"
class_name = "XLWxA H4bQ8"

if page_has_div_class(url, class_name):
    print("Found the div class!")
else:
    print("Class not found on page.")

edited Nov 23 at 17:54

asked Nov 23 at 6:25

Kyle Campbell

535 bronze badges

Have you tried debugging what's actually consumed by your parser? I suspect that when seeing an odd user agent/behavior, the server returns a captcha or other anti-scraping protection.

Martheen
– Martheen

2025-11-23 06:39:03 +00:00
Commented Nov 23 at 6:39
@Martheen Unfortunately there is nothing to debug. Since it is only 1 function it just passes. There isn't really a step-through option when there is only 1 step. As for the first part of your question, I don't know how to examine what exactly is being pulled by the parser. I am fairly new to this so if you want to point me toward some resources I would be very grateful.

Kyle Campbell
– Kyle Campbell

2025-11-23 06:44:42 +00:00
Commented Nov 23 at 6:44
In your example, the soup can be printed right away or prettified first crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing note that beautifulsoup won't run any JS that might be used for scraper detection. Selenium & Playwright might work in such a case, assuming the JS code doesn't also try to find the telltale of a controlled browser.

Martheen
– Martheen

2025-11-23 07:09:06 +00:00
Commented Nov 23 at 7:09
Can you please share whether you're getting a "Class not found on page." or an error? And if you're getting an error can you share which error?

exlead
– exlead

2025-11-23 07:16:55 +00:00
Commented Nov 23 at 7:16
1

This code works for me: print(BeautifulSoup(requests.get("https://www.tumblr.com/funny-text-posts/664730352044179457").text, "lxml").select_one(".XLWxA.H4bQ8"))

ggorlen
– ggorlen

2025-11-23 17:35:35 +00:00
Commented Nov 23 at 17:35

| Show 10 more comments

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

URL Targeted web crawler [closed]

0

Hot Network Questions