Reading multiple urls does not work in Python

Question

I want to webscrape a few urls. This is what I do:

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

url_2021_int = ["https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html","https://www.ecb.europa.eu/press/inter/date/2020/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2019/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2018/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2017/html/index_include.en.html"]

for url in url_2021_int:
    req_int = requests.get(url)
    
soup_int = BeautifulSoup(req_int.text)
titles_int = soup_int.select(".title a")
titles_int=[data.text for data in titles_int]

However, I get data only for the last url (2017).

What am I doing wrong?

Thanks!

req_int in req_int = requests.get(url) is re-written each time in the loop. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 13, 2021 at 14:56
Why not process all in the loop you have? Or just create a list. req_ints = [requests.get(url) for url in url_2021_int] — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 13, 2021 at 15:21

Wiktor Stribiżew · Accepted Answer · 2021-04-13 15:23:29Z

1

When you use req_int = requests.get(url) in the loop, the req_int variable is re-written each time.

If you want to store the requests.get(url) results in a list variable you can use

req_ints = [requests.get(url) for url in url_2021_int]

However, it seems logical to process the data in the same loop:

for url in url_2021_int:
    req_int = requests.get(url)
    soup_int = BeautifulSoup(req_int.text, "html.parser")
    titles_int = soup_int.select(".title a")
    titles_int=[data.text for data in titles_int]

Note that you can specify the "html.parser" as a second argument to the BeautifulSoup call, since the documents you are parsing are HTML documents.

answered Apr 13, 2021 at 15:23

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rollo99 Over a year ago

Thank you for your answer. With the first code it works but then it props up an error when trying to apply BeautifulSoup(), while with the second one titles_int is an empty object

Wiktor Stribiżew Over a year ago

@Rollo99 I helped to fix the immediate error. The rest is up to you to debug and fix. I just ran the code as is, and it seems to output values. There are quite a lot of items in each titles_int.

Collectives™ on Stack Overflow

Reading multiple urls does not work in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related