Selenium doesn't get all the href from a web page

Question

I am trying to get all the href links from https://search.yhd.com/c0-0-1003817/ (the ones that lead to the specific products), but although my code runs, it only gets 30 links. I don't know why this is happening. Could you help me, please?

I've been working with selenium (python 3.7), but previously I also tried to get the codes with beautiful soup. That didn't work either.

from selenium import webdriver 
import time
import requests
import pandas as pd

def getListingLinks(link):
    # Open the driver
    driver = webdriver.Chrome()
    driver.get(link)
    time.sleep(3)

    # Save the links
    listing_links = []
    links = driver.find_elements_by_xpath('//a[@class="img"]')
    for link in links:
        listing_links.append(str(link.get_attribute('href')))
    driver.close()
    return listing_links

imported = getListingLinks("https://search.yhd.com/c0-0-1003817/")

I should get 60 links, but I am only managing to get 30 with my code.

You can get the html page source using page_source attribute of selenium's "driver" and then use beautifulSoup's findAll function to get all the href tags. This will help you to get all the links present on the website. Further you can filter those href tags as per your needs. — Star Rider
– Star Rider, Commented Apr 18, 2019 at 5:08

SanV · Accepted Answer · 2019-04-18 14:58:19Z

2

at initial load, the page contains only 30 images/links. only when you scroll down, does it load all 60 items. you need to do the following:

def getListingLinks(link):
    # Open the driver
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(link)
    time.sleep(3)
    # scroll down: repeated to ensure it reaches the bottom and all items are loaded
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

    # Save the links
    listing_links = []
    links = driver.find_elements_by_xpath('//a[@class="img"]')
    for link in links:
        listing_links.append(str(link.get_attribute('href')))
    driver.close()
    return listing_links

imported = getListingLinks("https://search.yhd.com/c0-0-1003817/")

print(len(imported))  ## Output:  60

edited Apr 18, 2019 at 14:58

answered Apr 18, 2019 at 5:01

SanV

9459 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Flor Pupi Over a year ago

It works perfectly, thank you! Just to know, why does these lines of code repeat? driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(3) I've tried to remove one of the repetitions and that way it gets only 30 links, so I understand they make the code work, but why?

SanV Over a year ago

it doesn't reach all the way to the bottom when it's only once. you have to play with repeated scrolling and wait-time to get it to work for different sites. I added comment in the code above.

Collectives™ on Stack Overflow

Selenium doesn't get all the href from a web page

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related