Newest 'web-scraping' Questions

0 votes

1 answer

35 views

BeautifulSoup - Extracting content blocks after specific subheadings within a larger section, ignoring document introduction

I am scraping the Dead by Daylight Fandom wiki (specifically TOME pages, e.g., https://deadbydaylight.fandom.com/wiki/Tome_1_-_Awakening) to extract memory logs. The goal is to extract the Memory ...

zeromiedo

1

asked 5 hours ago

-2 votes

0 answers

51 views

To Run firefox browser using launch_persistent_context of playwright python [closed]

from playwright.sync_api import sync_playwright profile_path = r"C:\Users\kdutt\AppData\Roaming\Mozilla\Firefox\Profiles\p283dicx.default-release" firefox_path = r"C:\Program Files\...

Krishnendu Dutta

37

asked 2 days ago

3 votes

1 answer

40 views

Nodriver does not take exception if element not found?

I am trying to search for elements on a webpage and have used various methods, including text and XPath. It seems that the timeout option does not work the way I expected, and no exception is raised ...

Shankboy

33

asked Nov 26 at 13:42

0 votes

2 answers

202 views

Beautiful Soup, children are clearly inside but can't get it

From the below structure I only want value of href attribute. But rec_block is returning h5 element without its children so basically <h5 class="series">Recommendations</h5>. <...

Emby

1

asked Nov 25 at 18:27

0 votes

0 answers

46 views

UPS fuel surcharge history extracting [closed]

I previously extracted the US fuel surcharge history using this JSON endpoint: https://www.ups.com/assets/resources/fuel-surcharge/us.json But, it stopped updating data after 9/22/2025. How can I ...

maxi

1

asked Nov 25 at 16:07

0 votes

0 answers

77 views

URL Targeted web crawler [closed]

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to ...

Kyle Campbell

53

asked Nov 23 at 6:25

2 votes

0 answers

99 views

How to stop/kill achieved Scrapy spider instance within RStudio

I'm making a tutorial on how to scrape with Scrapy. For that, I use Quarto/RStudio and the website https://quotes.toscrape.com/. For pedagogic purposes, I need to run a first crawl on the first page, ...

Didier mac cormick

227

asked Nov 20 at 10:00

Advice

0 votes

4 replies

43 views

How to fetch realTime news Data feed

i wanted to know how i can get live news feed data (INDIAN) , without any or like minimal latency(30-40s), i tried using some rss feeds but all they do is provide the data as some latency so what i ...

its m

49

asked Nov 18 at 16:50

0 votes

0 answers

48 views

Camoufox browser window remains visible in WSL even when `headless` is set to `virtual`

Camoufox browser window remains visible in WSL even when headless is set to virtual Description When headless is set to "virtual", the Camoufox browser window still appears on the screen in ...

exlead

1

asked Nov 17 at 10:30

1 vote

0 answers

86 views

Invoke-WebRequest URL encoding

I want to retrieve content from web page. However, I tried above method but the error still come when the query string contain Chinese character. code $json = Get-Content -Encoding utf8 -Path "./...

Akira

33

asked Nov 12 at 3:07

-4 votes

2 answers

75 views

How can I get BBFC ratings in python? [closed]

I am trying to write code to give me BBFC film ratings. I am using selenium to do this but would be happy with any solution that works reliably. After a lot of work I finally came up with this code: #...

Simd

21.5k

asked Nov 9 at 14:42

0 votes

1 answer

211 views

Fetch data from https://www.sofascore.com/?

This is my python code using on ubuntu to try fetch and extract data from https://www.sofascore.com/ I create this test code before using on E2 device in my plugin # python3 -m venv venv # source venv/...

RR-EB

55

asked Nov 4 at 0:15

0 votes

1 answer

72 views

Using HTTPkerberosauth with a javascript enabled web scraper

I'm working on integration tests for a web application that's running in a Docker container within our GitLab CI/CD pipeline. The application is a frontend that requires Kerberos/SPNEGO authentication ...

ben green

33

asked Oct 30 at 17:47

0 votes

1 answer

65 views

Scrapy handle status 202

I'm quite new to web scraping, and in particular in using Scrapy's spiders, pipelines... I'm getting some 202 status from some spider requests' response, hence the page content is not available yet ...

Manu310

178

asked Oct 28 at 11:27

-1 votes

1 answer

47 views

How to loop an Apps Script / Cheerio web scraper over multiple urls? [closed]

I have this Apps Script / Cheerio function that successfully scrapes the data I want from the url. The site only displays 25 entries at this url. I can find additional entries on subsequent pages (by ...

zambonidude

9

asked Oct 24 at 6:28

1 vote

0 answers

40 views

Docsearch Typesense scraper only finds records on Docusaurus landing page

Problem I’m using Docusaurus with Typesense and the docsearch-typesense-scraper to index my documentation site. Everything runs fine — the sitemap is found, and the scraper produces records. However, ...

Erwin

11

asked Oct 20 at 12:59

0 votes

0 answers

149 views

How can I reliably scrape the Meta Ads Library for the latest ad launches?

I’m building a scraper to monitor the Meta (Facebook) Ads Library for new ads as soon as they start running. From inspecting network requests, I see that the Ads Library web app uses a GraphQL ...

kiqueboat

1

asked Oct 16 at 23:59

2 votes

1 answer

43 views

Cannot access 'iwe-autocomplete' element in html with selenium

Website photo with search box visible. So, this is the website https://sa.ucla.edu/ro/public/soc There is a dropdown menu for selecting subject area where I need to write subject and i will receive ...

Rohit Kasturi

23

asked Oct 16 at 10:05

0 votes

0 answers

126 views

Downloading Barchart.com table using Excel VBA

I'm trying to download the barchart data table from https://www.barchart.com/investing-ideas/ai-stocks using Excel VBA in similar manner as the python script in Automatic file downloading on Barchart....

ateene

1

asked Oct 12 at 18:07

-1 votes

1 answer

67 views

Selenium script marks all search results as “not found” because details load only after clicking a link [closed]

I’m using Python + Selenium + ChromeDriver to check a list of titles (from a CSV file) against an online library catalog. My script searches each title and tries to determine if a specific library has ...

huda

1

asked Oct 11 at 13:50

3 votes

2 answers

155 views

Importing a table from a webpage as a dataframe in Python

I am trying to read in a specific table from the US Customs and Border Protection's Dashboard on Southwest Land Border Encounters as a dataframe. The url is: https://www.cbp.gov/newsroom/stats/...

Ari

2,023

asked Oct 7 at 22:56

-2 votes

1 answer

117 views

Webscrape links to download files based on word in page HTML

I am webscraping WHO pages using the following code: pacman::p_load(rvest, httr, stringr, purrr) download_first_pdf_from_handle <- function(handle_id) { ...

flâneur

321

asked Oct 5 at 4:24

0 votes

0 answers

125 views

Unable to scrape product price from Shein ShareJump links in Laravel/Python

I’m working on a project in Laravel/Python where I want to fetch product information from Shein, but I’ve run into a major problem with ShareJump links. Here’s an example link I’m working with: http://...

Mahmod Algeriany

29

asked Oct 2 at 15:23

1 vote

1 answer

126 views

Scraping archived content [closed]

I am a bit new to webscraping and trying to build a scraper to collect the title, text, and date from this archived page: from selenium import webdriver from selenium.webdriver.chrome.service import ...

Kaitlin

83

asked Sep 30 at 13:49

1 vote

2 answers

88 views

Cannot click <a> element button with href="javascript:void(0)" with selenium

I'm using selenium in python and trying to click the "See all Properties" button to get to the next web page where all the properties will be listed and I can easily scrap the data. Here's ...

Gurnoor Kalsi

21

asked Sep 28 at 18:14

0 votes

0 answers

281 views

Scraping Instagram Likes at Bulk

My goal is to find out if a given user has liked any post of another profile. So the following question has to be answered: Has the user X liked any post on the profile Y in the past 24 months. For ...

a6i09per5f

300

asked Sep 24 at 20:44

-1 votes

2 answers

104 views

Selenium interaction with accordion list [closed]

I'm trying to scrape the data off this site. The website shows a charging station, in this case you can click each to unravel the accordion and see the data per charger. I am trying to use this ...

NorthoftheWall

9

asked Sep 11 at 21:29

3 votes

1 answer

157 views

How to clean inconsistent address strings in Python?

I'm working on a web scraping project in Python to collect data from a real estate website. I'm running into an issue with the addresses, as they are not always consistent. I've already handled simple ...

Adamzam15

41

asked Sep 11 at 12:13

1 vote

0 answers

68 views

Make.com Text parser: Attributes.href is empty — how to filter <a> links by href (relative + absolute) before aggregating?

ody: I’m building a Make.com scenario like this: HTTP (fetch website HTML) → Text parser (extract elements) → Filter "only good links" → Array aggregator → further processing Goal I want ...

Alex Lombardo

11

asked Sep 7 at 14:40

-1 votes

3 answers

245 views

Unable to scrape 2nd table from Fbref.com for players table

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/9/2023-2024/stats/2023-2024-Premier-League-Stats on google collab. But pd.read_html only gives me ...

rian patel

1

asked Sep 6 at 11:36

-1 votes

1 answer

113 views

How Do I Use Proxies with Puppeteer and a Local Chrome Instance?

I'm using Puppeteer and JS to write a web scraper. The site I'm scraping is pretty intense, so I need to use a local chrome instance and a residential proxy service to get it working. Here's my basic ...

Alex

41

asked Sep 6 at 0:22

2 votes

2 answers

189 views

Extracting html table and turn into tibble or data.frame in R

Using the following code: library(rvest) read_html("https://gainblers.com/mx/quinielas/progol-revancha/", encoding = "UTF-8")|> html_elements(xpath= '//*[@id="...

Alejandro Carrera

603

asked Sep 5 at 18:48

1 vote

2 answers

266 views

Default text reappears after being overwritten with intended text using Selenium Python

I am trying to extract bus prices between 2 cities in Ontario, Canada. I am using Selenium/Python to do this: The website is here and it has default cities and dates. Here is my Python code: from ...

brooklin7

13

asked Sep 5 at 13:46

1 vote

2 answers

111 views

Selenium select from dropdown menu

I'm a bit new to Selenium and am trying to build a webscraper that can select a dropdown menu and then select specific options from the menu. I've built the following code and it was working at one ...

Kaitlin

83

asked Sep 4 at 11:33

3 votes

1 answer

61 views

Beautiful Soup; splitting a paragraph only by <br> where stripped_strings is not working

I'm rather new to using Beautiful Soup and I'm having some issues splitting some html correctly by only looking at html breaks and ignoring other html elements such as changes in font color etc. The ...

James Brian

33

asked Aug 30 at 17:29

1 vote

1 answer

234 views

Trouble scraping dynamic lottery results table – inconsistent parsing

I’ve been trying to scrape lottery results from a website that shows draws. The data is presented in a results table, but I keep running into strange issues where sometimes the numbers are captured ...

Zuryab

11

asked Aug 27 at 10:50

2 votes

1 answer

201 views

Extract tables from website with dynamic content with R

I'm trying to extract tables from this site: https://www.dnb.com/business-directory/company-information.beverage_manufacturing.br.html As you can see, the complete table has 14,387 rows and each page ...

Alejandro Carrera

603

asked Aug 25 at 1:22

0 votes

0 answers

64 views

Disable assignment of window.location in Selenium

I'm trying to extract data from a website using Selenium. On random occasions, the page will do a client-side redirect with window.location. How can I disable this? I've tried redefining the property ...

anon

697

asked Aug 23 at 21:02

1 vote

1 answer

260 views

Firecrawl self-hosted crawler throws Connection violated security rules error

I set up a self-hosted Firecrawl instance and I want to crawl my internal intranet site (e.g. https://intranet.xxx.gov.tr/). I can access the site directly both from the host machine and from inside ...

birdalugur

307

asked Aug 22 at 13:47

0 votes

1 answer

112 views

Python Selenium find nested element [closed]

on this page I want to parse few elements. I would like to get text in circles and use attribute value to click sometimes. That code returns anything. With this code I want to get all attribute ...

Rok Golob

19

asked Aug 22 at 6:57

2 votes

1 answer

118 views

How to disable selenium logs AND run the browser in headless mode

This is my code as of now: from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.service import Service options = webdriver....

Ahmad

139

asked Aug 19 at 19:16

0 votes

1 answer

216 views

Pytube consistently fails with HTTP Error 400: Bad Request also on latest version

I am trying to use pytube (v15.0.0) to fetch the titles of YouTube videos. However, for every video I try, my script fails with the same error: HTTP Error 400: Bad Request. I have already updated ...

Rohit Hake

1

asked Aug 14 at 9:42

0 votes

0 answers

205 views

m3u8 HLS url VIdeo Not Playing with hls.js and Art Player

I have a node Scraper Which Scrapes the HLS streaming url using Playwright Browser which gives the master Playlist like: https://example.com/master.m3u8 Then that Master Playlist does have a cors ...

Alsiro Mira

23

asked Aug 7 at 14:56

1 vote

2 answers

294 views

How to download protected PDF (ViewDocument) using Selenium or requests?

I'm trying to download a protected PDF from the New York State Courts NYSCEF website using Python. The URL looks like this: https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=...

Daremitsu

655

asked Aug 4 at 10:28

4 votes

2 answers

285 views

How to reliably download 1969 “Gazzetta Ufficiale” PDFs (Italian Official Gazette) with Python?

I’m trying to programmatically download the full “pubblicazione completa non certificata” PDFs of the Italian Gazzetta Ufficiale – Serie Generale for 1969 (for an academic article). The site has a ...

Mark

1,801

asked Aug 1 at 13:13

-2 votes

2 answers

153 views

R Web Scraping - Data is Incomplete (Yahoo Finance)

I am using the following code. It successfully targets the correct url and node text. However, the data that is returned is incomplete as some of the fields (like previous close and open) are blank or ...

Brad Horn

685

asked Jul 30 at 18:30

0 votes

0 answers

69 views

Using ScrapingRobot API, how can I get google search results as structured JSON data?

How can I use ScrapingRobot’s API to scrape Google search results as structured JSON data (e.g., titles, URLs, snippets) instead of raw HTML? The main page of the website shows three types of "...

AtiehCodes

1

asked Jul 26 at 9:27

0 votes

2 answers

171 views

Extracting The SGF Data From This Webpage

I would like to scrape the problems from these Go (board game) books, and convert them into SGFs, if they aren't in that format already. For now, I would be satisfied with only taking the problems ...

psygo

7,853

asked Jul 26 at 2:28

3 votes

2 answers

382 views

Can't scrape all the titles from the map on the webpage using the requests module

I'm trying to create a script in Python to scrape all available titles that show up when clicking on the black-colored area on the map in this website. For example, when I click on a certain area on ...

MITHU

166

asked Jul 20 at 21:57

1 vote

0 answers

47 views

Pyppeteer returns None or empty content when scraping Digikala product page

I'm trying to scrape a product page from Digikala using Pyppeteer because the site is heavily JavaScript-rendered. Here is my render class: import asyncio from pyppeteer import launch from pyppeteer....

Ali Motamed

33

asked Jul 19 at 5:44

Collectives™ on Stack Overflow