3

I tried to run the code here.

However, I go the following message.

Did I miss some parameters?

What should be the correct approach to use requests to get the search?

Thank you very much.

This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible. <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.

import requests 
from bs4 import BeautifulSoup

headers_Get = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }


def google(q):
    s = requests.Session()
    q = '+'.join(q.split())
    url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
    r = s.get(url, headers=headers_Get)
    return r.text

result = google('"apple"')
6
  • Google serves that page to you precisely because they don't want you doing exactly what you're doing. (And, depending on where you live, intentionally circumventing their Terms of Service might carry legal consequences.) You might be able to change something about the requests you send to fool Google temporarily, but don't count on fooling them forever. You don't really want to get into an arms race with Google. Commented Feb 9, 2019 at 2:32
  • Google bans bots/crawlers for utilizing their search to prevent people from building alternative search engines using their resources. Their detection capabilities are very advanced, involving not only checking request headers, but also using sophisticated Javascript techniques that detect mouse/keyboard interactions, network traffic, and a variety of other things. You're unlikely to defeat Google's bot detection systems. ... But in general, to defeat bot detection on most sites, you can simply pass request headers (requests lets you do this) resembling those of a common browser like Firefox. Commented Feb 9, 2019 at 2:33
  • Possible duplicate of google search with python requests library Commented Feb 9, 2019 at 2:38
  • 1
    @J.Taylor: He's already sending a User-Agent header claiming to be Firefox. Commented Feb 9, 2019 at 2:41
  • This answer describes how you can create a custom search and API key to search the entire web stackoverflow.com/questions/37083058/… Commented Feb 9, 2019 at 2:45

2 Answers 2

2

I was using this for Google and it worked:

import requests
from urllib.request import Request, urlopen
import urllib
from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 
537.36 (KHTML, like Gecko) Chrome",
"Accept":"text/html,application/xhtml+xml,application/xml; 
q=0.9,image/webp,*/*;q=0.8"}

def google(q):
    q = '+'.join(q.split())
    url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
    reqest = Request(url,headers=headers)
    page = urlopen(reqest)
    soup = BeautifulSoup(page)
    return r.text
Sign up to request clarification or add additional context in comments.

1 Comment

Why do you import two libraries for making an HTTP request? Use either requests or built-in urllib library
0

It might be because user-agent is somewhat "wrong" Check what's your user-agent. Changing user-agent to the one from the attached link could help to get the full HTML output.

Also, you do not really need to create Session() if you don't want to persist certain parameters across requests or make several requests to the same host with the same parameters.

import requests 
from bs4 import BeautifulSoup

headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
}

def google(q):
    response = requests.get(f'https://www.google.com/search?q={q}', headers=headers).text
    return response

result = google('"apple"')
print(result)

Alternatively, you can get results fast without thinking about such things by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

The difference is that, well, you have to think about the data you want to get, rather than figuring out how to bypass blocks or all sort of other things and maintain it over time.

Code to integrate (for example, you want to scrape each title, link from first page of organic results):

import os
from serpapi import GoogleSearch

def serpapi_get_google_result():
    params = {
      "engine": "google",               # search engine to search from
      "q": "tesla",                     # query
      "hl": "en",                       # language
      "gl": "us",                       # country to search from
      "api_key": os.getenv("API_KEY"),  # https://serpapi.com/dashboard
    }

    search = GoogleSearch(params)
    results = search.get_dict()
  
    for result in results["organic_results"]:
      print(result['title'])
      print(result['link'])


serpapi_get_google_result()

Disclaimer, I work for SerpApi.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.