2

I wish to extract all forms from a given website using Python3 and BeautifulSoup.

Here is an example that does this, but fails to pick up all forms:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.qantas.com/au/en.html'
data = urlopen(url)
parser = BeautifulSoup(data, 'html.parser')
forms = parser.find_all('form')
for form in forms:
    print(form)
    print('\n\n')

If you run the code and visit the URL, you will notice that the Book a trip form is not scraped by the parser.

The above code only picks up three forms, whereas in Chrome's Developers tools > elements page shows 13 <form> elements. But if I view the page source (Ctrl+U in Chrome), the source only shows the three forms that BeautifulSoup scraped.

How can I scrape all forms?

1
  • 1
    Not sure what is going on here, but if you go to View Source for the page, it shows only three forms there, which is exactly what you're getting. Could it be that the other forms are generated from a server request after the page is loaded? Commented Mar 27, 2017 at 0:52

2 Answers 2

1

It seems that the web page uses JavaScript to load the web content. Try to view the page in your browser with the javascript disabled.

Check if your form is there. If not, check if it is any XHR request in the console that fetches the form. If not, you should think about go to selenium with phantomjs headless browser or abandon the scraping of this site!!

The headless browser will allow you to get the content of the dynamically created web page and feed that content to BeautifulSoup.

Sign up to request clarification or add additional context in comments.

Comments

1

With the help of phantomjs(http://phantomjs.org/download.html) and Selenium you can do this

Step: 1. on terminal or cmd use command: pip install selenium 2. Download the phantomjs & unzip it than put the "phantomjs.exe" at python path for example on windows, C:\Python27

Than use this code it will give you desired result:

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
from  selenium import webdriver


url = 'https://www.qantas.com/au/en.html'


driver = webdriver.PhantomJS()
driver.get(url)

data = driver.page_source
parser = BeautifulSoup(data, 'html.parser')


forms = parser.find_all('form')
for form in forms:
    print(form)
    print('\n\n')

driver.quit()

It will print all 13 forms.

Note:Due to word limit not able to put output in Answer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.