1

I am trying to access my school's intranet to web scrape it and retrieve the table with the homework I have to complete, I searched the web for any solutions but I couldn't find any. I will not provide the login credentials for obvious reasons, but i will provide the html data. Any help is great, thanks.

My code so far:

import requests

while True:
    Post_Login_URL = 'http://parents.netherhall.org/'
    Request_URL = 'https://parents.netherhall.org/parents/students/?admissionno=011161&page=homework'
    username = input('What is your username? ')
    password = input('What is your password? ')
    payload = {
        'username': username,
        'password': password
    }
    with requests.Session() as session:
        post = session.post(Post_Login_URL, data=payload)
        r = session.get(Request_URL)
        print(r.text)

the response I get:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML dir=ltr><HEAD><TITLE>The page cannot be displayed</TITLE>
<STYLE id=L_defaultr_1>A:link {
    FONT: 8pt/11pt verdana; COLOR: #ff0000
}
A:visited {
    FONT: 8pt/11pt verdana; COLOR: #4e4e4e
}
</STYLE>

<META content=NOINDEX name=ROBOTS>
<META http-equiv=Content-Type content="text-html; charset=UTF-8">

<META content="MSHTML 5.50.4522.1800" name=GENERATOR></HEAD>
<BODY bgColor=#ffffff>
<TABLE cellSpacing=5 cellPadding=3 width=410>
  <TBODY>
  <TR>
    <TD id=L_defaultr_0 valign=middle align=left width=360>
      <H1 id=L_defaultr_2 style="FONT: 13pt/15pt verdana; COLOR: #000000"><ID id=L_defaultr_3><!--Problem-->The page cannot be displayed
</ID></H1></TD></TR>
  <TR>
    <TD width=400 colSpan=2><FONT id=L_defaultr_4
      style="FONT: 8pt/11pt verdana; COLOR: #000000"><ID id=L_defaultr_5><B>Explanation: </B>There is a problem with the page you are trying to reach and it cannot be displayed.</ID></FONT></TD></TR>
  <TR>
    <TD width=400 colSpan=2><FONT id=L_defaultr_6 
      style="FONT: 8pt/11pt verdana; COLOR: #000000">
      <HR color=#c0c0c0 noShade>

      <P id=L_defaultr_7><B>Try the following:</B></P>
      <UL>
        <LI id=L_defaultr_8><B>Refresh page:</B> Search for the page again by clicking the Refresh button. The timeout may have occurred due to Internet congestion.
<LI id=L_defaultr_9><B>Check spelling:</B> Check that you typed the Web page address correctly. The address may have been mistyped.
<LI id=L_defaultr_10><B>Access from a link:</B> If there is a link to the page you are looking for, try accessing the page from that link.

      </UL>
      <HR color=#c0c0c0 noShade>

      <P id=L_defaultr_11>Technical Information (for support personnel)</P>
      <UL>
        <LI id=L_defaultr_12>Error Code: 401 Unauthorized. The server requires authorization to fulfill the request. Access to the Web server is denied. Contact the server administrator. (12209)

        </UL></FONT></TD></TR></TBODY></TABLE></BODY></HTML>
3
  • How could i do that? Also, after i login, this URL seems to auto suggest https://parents.netherhall.org/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=39 but when i tried it, it didn't work. Commented Sep 5, 2018 at 16:56
  • You should start by looking at the login page and see how it works. For example, are you able to login to the page with a simple browser with no JavaScript enabled? If the page requires JavaScript in order to login, you will need a more sophisticated approach than what you've shown. Even if JS is not required, the login page seems to mention other keys that you don't set in your code such as flags and forcedownlevel. It's possible if you don't set all the expected keys, the server will reject your request because it looks like your user agent is not filling out their form properly. Commented Sep 5, 2018 at 19:12
  • @TudorPopescu I don't know if you must use requests but selenium is very easy to use when logging into websites because it automates a browser. You can always run it in headless as well. Commented Sep 5, 2018 at 20:12

1 Answer 1

1

You have to set requests headers, because, by default, User-Agent is something like 'python requests'.

To do it, open your browser, if you're using Chrome, press Cntrl+E, if you're using Firefox, press Cntrl+Shift+E. Then go to Network. Now login in the website and on left (or below) will appear a row that represents the requests to parents.netherhall.org. Click on it and copy headers.

Then implement them like so:

from requests import Session

# Create headers dict.
headers = {
    'header_name': 'header_value', # and so on
}

Post_Login_URL = 'http://parents.netherhall.org/'
Request_URL = 'https://parents.netherhall.org/parents/students/?admissionno=011161&page=homework'
username = input('What is your username? ')
password = input('What is your password? ')
payload = {
    'username': username,
    'password': password
}
with Session() as session:
    post = session.post(Post_Login_URL, headers=headers data=payload)
    print(r.text) # Page source.
    print('Logged in successfully:', r.ok)
Sign up to request clarification or add additional context in comments.

6 Comments

Hi, when i logged on, around 20 files showed up which are not images. How do I go about copying and getting the data from the headers?
Do you mean when you logged in using the browser? If so, those are not files, are the requests you made. You should see something like: Status: 200 | Method: POST | File: CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=42 | Domain: parents.netherhall.org | Origin: document | Type: html. Click on it and copy the headers on the left. If you don't find it, send me an email with all the requests.
Alternatively, you could try to change only the User-Agent header, but I don't know I it would work. Initialize headers dict like so: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/61.0'}
I looked into the subject and I found that selenium is much easier to use. I did get round to using your method(Thanks :)) and tried it out, it did work. But I found selenium to be more flexible in the way I can navigate around the website easier.
Yes, selenium is very useful in some web-testing tasks, however, requests is much faster, because it doesn't actually manage a browser, just send requests directly. If you need help with selenium module, just tell me :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.