0

I am using this code to scrape emails from Google search results. However, it only scrapes the first 10 results, despite having 100 search results loaded.

Ideally, I would like for it to scrape all search results.

Is there a reason for this?

from selenium import webdriver
import time
import re
import pandas as pd

PATH = 'C:\Program Files (x86)\chromedriver.exe'

l = list()
o = {}

target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"

driver = webdriver.Chrome(PATH)

driver.get(target_url)

email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)

time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx', index=False)
#print(emails)
driver.close()
10
  • Where are you specifying that the web page should load 100 results? Commented Feb 7, 2023 at 10:31
  • google search results settings Commented Feb 7, 2023 at 10:33
  • Where in your code are you specifying that? When I put that URL in it just returns the top 10. Commented Feb 7, 2023 at 10:40
  • 2
    Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL. Commented Feb 7, 2023 at 10:46
  • 1
    Second result here. tldevtech.com/how-to-show-100-results-per-page-in-google-search Commented Feb 7, 2023 at 10:50

3 Answers 3

2

The code is working as expected and scraping 10 results which is the default from Google Search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.

This operation needs to be done till the sufficient results are collected in loop. Refer this for more details: Selenium locating elements.

How can you use the Selenium commands? Probably you can look up on the web. I found one similar question which can provide some references.

Sign up to request clarification or add additional context in comments.

1 Comment

where would i add this into my code?
0

Selenium loads its own empty browser so your Google settings for 100 results need to be in the code, because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL

If you need further information on query parameters to achieve this, it’s the second method described in How to Show 100 Results Per Page in Google Search.

Comments

0

Following up on Bijendra's answer, you could update the code as below:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd


PATH = 'C:\Program Files (x86)\chromedriver.exe'

l = list()
o = {}

target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"

driver = webdriver.Chrome(PATH)

driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
    html = driver.page_source
    for e in re.findall(email_pattern, html):
        emails.append(e)
    a_attr = driver.find_element(By.ID, "pnnext")
    a_attr.click()

time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv', index=False)
driver.close()

You could either change the range value passed in the for loop or entirely replace the for loop with a while loop, so instead of

for i in range(2):

You could do:

while len(emails) < 100:

Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on the search result page.

Make sure to refer to the documentation to get a clear idea of what you should do to achieve what you want to.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.