Left closed in review as "Not suitable for this site" by user4157124, bfontaine, Maveňツ

occurred Nov 26 at 10:52

added 968 characters in body

Source Link

edited Nov 23 at 17:54

53
5

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to a txt file, and if not it will skip.

I was able to determine that if a post does not exist, then tumblr will have a specific class name for a div where it shows the "post not found" dialogue, which means I can check to see if that div exists and if not then it will save the URL.

The problem I am running into is that BeautifulSoup, Selenium, and Playwright do not want to work no matter what. I have tried doing delayed starts incasein case the page was still loading, but I am starting to run out of ideas. Anyone have any thoughts on this one?

The code is fairly basic, what it is supposed to do is check for a status code of 200 to ensure the server processed and responded, then checks the page elements for the class name. The next steps are dependent on the outcome of that logic test.

**I was asked to include more info about what it is searching for. So I am copying this response over from the comments to provide greater clarity.

If you pick any tumblr post on tumblr and then go up 1 number it will normally not be a working link. e.g. tumblr.com/funny-text-posts/664730352044179456 works & tumblr.com/funny-text-posts/664730352044179457 does not.

You'll notice that you will get one of 5 specific text responses: This post went to heaven. This post is gone, gone, gone. But there's more, more, more, on funny-text-posts's Tumblr. This post has ceased to exist. You're too late. This post is no more. This post isn't here anymore, but the Tumblr still is.

If you search for that text in inspector you will find the class name only on the page where the automated response is listed. And before anyone asks, yes I have also tried searching for the text directly by adding all 5 responses to an array, which obviously also failed

import requests
from bs4 import BeautifulSoup

def page_has_div_class(url, class_name):
try:
    r = requests.get(url, timeout=10)
    if r.status_code != 200:
        return False  # Page not valid

    soup = BeautifulSoup(r.text, "html.parser")

    # Find ANY <div> whose full class attribute matches exactly
    for div in soup.find_all("div"):
        if div.get("class") and " ".join(div.get("class")) == class_name:
            return True

    return False

except Exception as e:
    print("Error:", e)
    return False

url = "https://www.tumblr.com/BLOG_NAME_TO_SCAN/posts/1"
class_name = "XLWxA H4bQ8"

if page_has_div_class(url, class_name):
    print("Found the div class!")
else:
    print("Class not found on page.")

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to a txt file, and if not it will skip.

I was able to determine that if a post does not exist, then tumblr will have a specific class name for a div where it shows the "post not found" dialogue, which means I can check to see if that div exists and if not then it will save the URL.

The problem I am running into is that BeautifulSoup, Selenium, and Playwright do not want to work no matter what. I have tried doing delayed starts incase the page was still loading, but I am starting to run out of ideas. Anyone have any thoughts on this one?

The code is fairly basic, what it is supposed to do is check for a status code of 200 to ensure the server processed and responded, then checks the page elements for the class name. The next steps are dependent on the outcome of that logic test.

import requests
from bs4 import BeautifulSoup

def page_has_div_class(url, class_name):
try:
    r = requests.get(url, timeout=10)
    if r.status_code != 200:
        return False  # Page not valid

    soup = BeautifulSoup(r.text, "html.parser")

    # Find ANY <div> whose full class attribute matches exactly
    for div in soup.find_all("div"):
        if div.get("class") and " ".join(div.get("class")) == class_name:
            return True

    return False

except Exception as e:
    print("Error:", e)
    return False

url = "https://www.tumblr.com/BLOG_NAME_TO_SCAN/posts/1"
class_name = "XLWxA H4bQ8"

if page_has_div_class(url, class_name):
    print("Found the div class!")
else:
    print("Class not found on page.")

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to a txt file, and if not it will skip.

I was able to determine that if a post does not exist, then tumblr will have a specific class name for a div where it shows the "post not found" dialogue, which means I can check to see if that div exists and if not then it will save the URL.

The problem I am running into is that BeautifulSoup, Selenium, and Playwright do not want to work no matter what. I have tried doing delayed starts in case the page was still loading, but I am starting to run out of ideas. Anyone have any thoughts on this one?

The code is fairly basic, what it is supposed to do is check for a status code of 200 to ensure the server processed and responded, then checks the page elements for the class name. The next steps are dependent on the outcome of that logic test.

**I was asked to include more info about what it is searching for. So I am copying this response over from the comments to provide greater clarity.

If you pick any tumblr post on tumblr and then go up 1 number it will normally not be a working link. e.g. tumblr.com/funny-text-posts/664730352044179456 works & tumblr.com/funny-text-posts/664730352044179457 does not.

You'll notice that you will get one of 5 specific text responses: This post went to heaven. This post is gone, gone, gone. But there's more, more, more, on funny-text-posts's Tumblr. This post has ceased to exist. You're too late. This post is no more. This post isn't here anymore, but the Tumblr still is.

If you search for that text in inspector you will find the class name only on the page where the automated response is listed. And before anyone asks, yes I have also tried searching for the text directly by adding all 5 responses to an array, which obviously also failed

import requests
from bs4 import BeautifulSoup

def page_has_div_class(url, class_name):
try:
    r = requests.get(url, timeout=10)
    if r.status_code != 200:
        return False  # Page not valid

    soup = BeautifulSoup(r.text, "html.parser")

    # Find ANY <div> whose full class attribute matches exactly
    for div in soup.find_all("div"):
        if div.get("class") and " ".join(div.get("class")) == class_name:
            return True

    return False

except Exception as e:
    print("Error:", e)
    return False

url = "https://www.tumblr.com/BLOG_NAME_TO_SCAN/posts/1"
class_name = "XLWxA H4bQ8"

if page_has_div_class(url, class_name):
    print("Found the div class!")
else:
    print("Class not found on page.")

Post Closed as "Not suitable for this site" by ggorlen, Clive, Corey Goldberg

occurred Nov 23 at 15:03

Source Link

asked Nov 23 at 6:25

Kyle Campbell

53
5

URL Targeted web crawler

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to a txt file, and if not it will skip.

I was able to determine that if a post does not exist, then tumblr will have a specific class name for a div where it shows the "post not found" dialogue, which means I can check to see if that div exists and if not then it will save the URL.

The problem I am running into is that BeautifulSoup, Selenium, and Playwright do not want to work no matter what. I have tried doing delayed starts incase the page was still loading, but I am starting to run out of ideas. Anyone have any thoughts on this one?

The code is fairly basic, what it is supposed to do is check for a status code of 200 to ensure the server processed and responded, then checks the page elements for the class name. The next steps are dependent on the outcome of that logic test.

import requests
from bs4 import BeautifulSoup

def page_has_div_class(url, class_name):
try:
    r = requests.get(url, timeout=10)
    if r.status_code != 200:
        return False  # Page not valid

    soup = BeautifulSoup(r.text, "html.parser")

    # Find ANY <div> whose full class attribute matches exactly
    for div in soup.find_all("div"):
        if div.get("class") and " ".join(div.get("class")) == class_name:
            return True

    return False

except Exception as e:
    print("Error:", e)
    return False

url = "https://www.tumblr.com/BLOG_NAME_TO_SCAN/posts/1"
class_name = "XLWxA H4bQ8"

if page_has_div_class(url, class_name):
    print("Found the div class!")
else:
    print("Class not found on page.")

Collectives™ on Stack Overflow

Return to Question

URL Targeted web crawler