BeautifulSoup does not extract all forms from web page

Question

I wish to extract all forms from a given website using Python3 and BeautifulSoup.

Here is an example that does this, but fails to pick up all forms:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.qantas.com/au/en.html'
data = urlopen(url)
parser = BeautifulSoup(data, 'html.parser')
forms = parser.find_all('form')
for form in forms:
    print(form)
    print('\n\n')

If you run the code and visit the URL, you will notice that the Book a trip form is not scraped by the parser.

The above code only picks up three forms, whereas in Chrome's Developers tools > elements page shows 13 <form> elements. But if I view the page source (Ctrl+U in Chrome), the source only shows the three forms that BeautifulSoup scraped.

How can I scrape all forms?

Not sure what is going on here, but if you go to View Source for the page, it shows only three forms there, which is exactly what you're getting. Could it be that the other forms are generated from a server request after the page is loaded? — Abid Hasan
– Abid Hasan, Commented Mar 27, 2017 at 0:52

Community · Accepted Answer · 2017-05-23 12:02:04Z

1

It seems that the web page uses JavaScript to load the web content. Try to view the page in your browser with the javascript disabled.

Check if your form is there. If not, check if it is any XHR request in the console that fetches the form. If not, you should think about go to selenium with phantomjs headless browser or abandon the scraping of this site!!

The headless browser will allow you to get the content of the dynamically created web page and feed that content to BeautifulSoup.

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Mar 27, 2017 at 7:21

Christos Papoulas

2,5803 gold badges30 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

thebadguy · Accepted Answer · 2017-03-27 12:20:06Z

With the help of phantomjs(http://phantomjs.org/download.html) and Selenium you can do this

Step: 1. on terminal or cmd use command: pip install selenium 2. Download the phantomjs & unzip it than put the "phantomjs.exe" at python path for example on windows, C:\Python27

Than use this code it will give you desired result:

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
from  selenium import webdriver


url = 'https://www.qantas.com/au/en.html'


driver = webdriver.PhantomJS()
driver.get(url)

data = driver.page_source
parser = BeautifulSoup(data, 'html.parser')


forms = parser.find_all('form')
for form in forms:
    print(form)
    print('\n\n')

driver.quit()

It will print all 13 forms.

Note:Due to word limit not able to put output in Answer.

Collectives™ on Stack Overflow

BeautifulSoup does not extract all forms from web page

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related