2

I want to scrape Audible websites using Python Beautiful Soup. There are some data that I cannot access unless I log into my Audible account. It is a subsidiary of Amazon.com. I have been unsuccessful. I just want to login using Python and scrape the html.

I have tried various code such as this How to login to Amazon using BeautifulSoup. One would think that simply substituting my credentials in this code would work.

1 Answer 1

1

Unfortunately this can no longer be simply automated in Python. This is as far as i could get with Audible AU. POST requires bunch of headers, which most of them can be extracted, except for metadata1 (more on that at the bottom):

"""load packages"""
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlsplit, parse_qs

"""define URL where login form is located"""
site = "https://www.audible.com.au/signin"

"""initiate session"""
session = requests.Session()

"""define session headers"""
session.headers = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "en-US,en;q=0.9,cs;q=0.8",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "metadata1": "",
}

"""get login page"""
resp = session.get(site)
html = resp.text

"""extract clientContext from the login page"""
query = urlsplit(resp.url).query
params = parse_qs(query)
clientContext = params["clientContext"]
new_login_url = "https://www.amazon.com.au/ap/signin/" + str(clientContext[0])

"""get BeautifulSoup object of the html of the login page"""
soup = BeautifulSoup(html, "lxml")

"""scrape login page to get all the needed inputs required for login"""
data = {}
form = soup.find("form", {"name": "signIn"})
for field in form.find_all("input"):
    try:
        data[field["name"]] = field["value"]
    except:
        pass

"""add username and password to the data for post request"""
data[u"email"] = "EMAIL"
data[u"password"] = "PASSWORD"

"""display: redirect URL, appActionToken, appAction, siteState, openid.return_to, prevRID, workflowState, create, email, password"""
print(new_login_url, data)

"""submit post request with username / password and other needed info"""
post_resp = session.post(new_login_url, data=data, allow_redirects=True)
post_soup = BeautifulSoup(post_resp.content, "lxml")

"""check the captcha"""
warning = post_soup.find("div", id="auth-warning-message-box")
if warning:
    print("Warning:", warning)
else: print(post_soup)

session.close()

Add your e-mail address and password on lines 48, 49. Also log in with your browser and inspect the traffic to see what is metadata1 on your computer and add it on line 22. If you're lucky and you will not get detected as bot, you will get in, otherwise you will get captcha image.

metadata1 is a massive payload in base64 which consist data collected by your browser which uniquely identifies you and differentiates you from the bots (mouse-clicks, delay in typing, page script, browser information & compatibility & extensions, flash version, user agent, script performance, hardware - GPU, local storage, canvas size, etc ...)

Sign up to request clarification or add additional context in comments.

1 Comment

I couldn't find the metadata in the network console. Also, I recently enabled Timed One Time Passwords. I ran into that and not captcha. Still, your script got me closer than anything else.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.