0

I want to webscrape a few urls. This is what I do:

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

url_2021_int = ["https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html","https://www.ecb.europa.eu/press/inter/date/2020/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2019/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2018/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2017/html/index_include.en.html"]

for url in url_2021_int:
    req_int = requests.get(url)
    
soup_int = BeautifulSoup(req_int.text)
titles_int = soup_int.select(".title a")
titles_int=[data.text for data in titles_int]


However, I get data only for the last url (2017).

What am I doing wrong?

Thanks!

4
  • 2
    req_int in req_int = requests.get(url) is re-written each time in the loop. Commented Apr 13, 2021 at 14:56
  • 1
    You missed tabulation on the last three lines Commented Apr 13, 2021 at 14:57
  • @WiktorStribiżew how do you store the output then? Commented Apr 13, 2021 at 15:16
  • 1
    Why not process all in the loop you have? Or just create a list. req_ints = [requests.get(url) for url in url_2021_int] Commented Apr 13, 2021 at 15:21

1 Answer 1

1

When you use req_int = requests.get(url) in the loop, the req_int variable is re-written each time.

If you want to store the requests.get(url) results in a list variable you can use

req_ints = [requests.get(url) for url in url_2021_int]

However, it seems logical to process the data in the same loop:

for url in url_2021_int:
    req_int = requests.get(url)
    soup_int = BeautifulSoup(req_int.text, "html.parser")
    titles_int = soup_int.select(".title a")
    titles_int=[data.text for data in titles_int]

Note that you can specify the "html.parser" as a second argument to the BeautifulSoup call, since the documents you are parsing are HTML documents.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your answer. With the first code it works but then it props up an error when trying to apply BeautifulSoup(), while with the second one titles_int is an empty object
@Rollo99 I helped to fix the immediate error. The rest is up to you to debug and fix. I just ran the code as is, and it seems to output values. There are quite a lot of items in each titles_int.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.