0

When trying to render a page using requests_html, I get access denied from the server. When I send via requests I get the HTML.

Why do I get access denied?

Code

from requests_html import HTMLSession
s = HTMLSession()

base_url = 'https://secure.louisvuitton.com/eng-gb/checkout/review'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-GB,en;q=0.5',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'TE': 'Trailers',
}

r = s.get('https://secure.louisvuitton.com/eng-gb/checkout/review', headers=headers)
print(r)


r.html.render()
print(r.html.text)

Terminal

<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://secure.louisvuitton.com/eng-gb/checkout/review" on this server.
Reference #18.6fce7a5c.1597604631.1e8bfd7

1 Answer 1

1

It looks like this site doesn't like a headless browsers and it detects this from the User-Agent header. In my case it was:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/60.0.3112.113 Safari/537.36

Now, the requests_html module is using Pyppeteer under the hood to render JavaScript. There is an option to set the UA for a page in Pyppeteer but I don't see a convenient way to overwrite some class to make this change. The page is defined in _async_render function (a coroutine to be precise).

You can try to use Pyppeteer directly and then only parse the HTML using requests_html:

import asyncio
import traceback

from pyppeteer import launch
from requests_html import HTML

URL = 'https://secure.louisvuitton.com/eng-gb/checkout/review'
UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'


async def fetch(url, browser):
    page = await browser.newPage()
    await page.setUserAgent(UA)

    try:
        await page.goto(url, {'waitUntil': 'load'})
    except:
        traceback.print_exc()
    else:
        return await page.content()
    finally:
        await page.close()


async def main():
    browser = await launch(headless=True, args=['--no-sandbox'])

    doc = await fetch(URL, browser)
    await browser.close()

    html = HTML(html=doc)
    print(html.links)


if __name__ == '__main__':
    asyncio.run(main())
Sign up to request clarification or add additional context in comments.

3 Comments

Amazing, I had thought the UA would be the issue here, thank you! I have a whole script written using HTMLSession() session requests, I want to use _async_render for the last request in the project only. Do I need to rewrite the entire script in _async_render to utalise this on the last requests?
How would I set cookies for the requests?
You can use Page.setCookie to set cookies.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.