Scrape html links Python

Question

Hello everyone I'm trying to get all href links with python by using this :

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

#Collecting links on rappel.gouv
def get_url(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links = url + item.find('a', {'class' : 'product-link'})['href']

    return links

soup = get_url(url)
print(extract(soup))

I'm supposed to get 10 htmls links as following :

https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4572/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4573/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4575/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4569/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4565/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4568/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4570/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4567/Interne
https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne

it actually works when I write print into the code as following :

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links = url + item.find('a', {'class' : 'product-link'})['href']
        print(links)

    return

but I'm supposed with all the links I get from this request put them into a loop so I'll get data from each of those 10 pages and store them in a database (so it means there are lines code to write after def extract(soup)to come.

I have tried to understand with many tutorials, I get ever one html or a none

Alecx · Accepted Answer · 2021-11-26 21:46:13Z

1

You just need to build a list of links, in your code the variable links only resets each time in the loop. Try this:

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    links = []
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links.append(url + item.find('a', {'class' : 'product-link'})['href'])


    return links

To print each link in main code after functions:

soup = get_url(url)
linklist = extract(soup)
for url in linklist:
    print(url)

edited Nov 26, 2021 at 21:46

answered Nov 26, 2021 at 21:19

Alecx

791 silver badge4 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Oxykore Over a year ago

Thank you :) but I did that too, I get a result as following : ['https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne', ... 'https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne'] but I was wondering... let's say I name this output url_data = extract(soup), I'm going to implement url_data like this request.get(url_data) for then I use bs4, to extract data for each pages, do you think that it will work ? cause i'm afraid of this such of errors requests.exceptions.InvalidSchema: No connection adapters were found for "['rappel.conso.gouv.fr']"

Alecx Over a year ago

You can access a link in you list by an index: soup = get_url(url) linklist = extract(soup) print(linklist[0]) print(linklist[1]) Surely you can iterate over this list in a loop. for url in linklist: print(url)

Oxykore Over a year ago

Thank you very much !! it very much appreciated, thanks to everyone else as well :)

Alecx Over a year ago

One more thing: if you need to keep the starting url in a variable url it's better to set some different name of the variable in the last loop :)

Marcelo Wizenberg · Accepted Answer · 2021-11-26 21:12:12Z

0

Your links variable is being rewritten inside the for loop.

You can create an empty list before the loop, then append the URL on each iteration.

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

#Collecting links on rappel.gouv
def get_url(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract(soup):
    results = soup.find_all('div', {'class' : 'product-content'})
    links = []
    for item in results:
        item.find('a', {'class' : 'product-link'}).text.replace('','').strip()
        links.append(url + item.find('a', {'class' : 'product-link'})['href'])

    return links

soup = get_url(url)
print(extract(soup))

answered Nov 26, 2021 at 21:12

Marcelo Wizenberg

1211 bronze badge

1 Comment

Oxykore Over a year ago

yes I did that too, I get a result as following : ['https://rappel.conso.gouv.fr/fiche-rappel/4571/Interne', ... 'https://rappel.conso.gouv.fr/fiche-rappel/4558/Interne'] but my question then is... let's say I name this output url_data = extract(soup), I'm going to implement url_data like this request.get(url_data) for then I use bs4, to extract data for each pages, do you think that it will work ? cause i'm afraid of this such of errors requests.exceptions.InvalidSchema: No connection adapters were found for "['https://rappel.conso.gouv.fr']"

HedgeHog · Accepted Answer · 2021-11-26 21:53:12Z

To use the links from the page to iterate over each products detail page collect the links in a list and return it from the funtion.

Try to name your functions more like what they are returning get_url() is more get_soup(),...

Example

import requests
from bs4 import BeautifulSoup

url = 'https://rappel.conso.gouv.fr'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

def get_soup(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def extract_product_urls(url):
    links = [url+x['href'] for x in get_soup(url).select('a.product-link')]
    return links

def extract_product_details(url):
    soup = get_soup(url)
    items = {}

    for x in soup.select('.product-desc li'):
        content = x.get_text('|', strip=True).split('|')
        items[content[0]]=content[1]

    return items

data = []

for link in extract_product_urls(url):
    data.append(extract_product_details(link))

data

Output

[{'Réf. Fiche\xa0:': '2021-11-0273',
  '№ de Version\xa0:': '1',
  'Origine de la fiche\xa0:': 'PLACE DU MARCHE PLACE DU MARCHE',
  'Nature juridique du rappel\xa0:': 'Volontaire',
  'Catégorie de produit': 'Alimentation',
  'Sous-catégorie de produit': 'Lait et produits laitiers',
  'Nom de la marque du produit': 'Toupargel',
  'Noms des modèles ou références': 'BATONNETS GEANTS VANILLE AMANDES',
  'Identification des produits': 'GTIN',
  'Conditionnements': '292G',
  'Date début/Fin de commercialisation': 'Du\r\n                            11/07/2019\r\n                            au\r\n                            18/09/2021',
  'Température de conservation': 'Produit à conserver au congélateur',
  'Marque de salubrité': 'EMB 35360C',
  'Zone géographique de vente': 'France entière',
  'Distributeurs': 'PLACE DU MARCHE',
  'Motif du rappel': 'Nous tenons à vous informer, que suite à une alerte européenne concernant la présence potentielle d’oxyde d’éthylène à une teneur supérieure à la limite autorisée, et comme un grand nombre d’acteurs de la distribution, nous devons procéder au rappel',
  'Risques encourus par le consommateur': 'Autres contaminants chimiques',
  'Conduite à tenir par le consommateur': 'Ne plus consommer',
  'Numéro de contact': '0805805910',
  'Modalités de compensation': 'Remboursement',
  'Date de fin de la procédure de rappel': 'samedi 26 février 2022'},
 {'Réf. Fiche\xa0:': '2021-11-0274',
  '№ de Version\xa0:': '1',
  'Origine de la fiche\xa0:': 'PLACE DU MARCHE PLACE DU MARCHE',
  'Nature juridique du rappel\xa0:': 'Volontaire',
  'Catégorie de produit': 'Alimentation',
  'Sous-catégorie de produit': 'Lait et produits laitiers',
  'Nom de la marque du produit': 'Toupargel',
  'Noms des modèles ou références': 'CREME GLACEE NOUGAT',
  'Identification des produits': 'GTIN',
  'Conditionnements': '469G',
  'Date début/Fin de commercialisation': 'Du\r\n                            28/06/2019\r\n                            au\r\n                            10/10/2021',
  'Température de conservation': 'Produit à conserver au congélateur',
  'Marque de salubrité': 'EMB 35360C',
  'Zone géographique de vente': 'France entière',
  'Distributeurs': 'PLACE DU MARCHE',
  'Motif du rappel': 'Nous tenons à vous informer, que suite à une alerte européenne concernant la présence potentielle d’oxyde d’éthylène à une teneur supérieure à la limite autorisée, et comme un grand nombre d’acteurs de la distribution, nous devons procéder au rappel',
  'Risques encourus par le consommateur': 'Autres contaminants chimiques',
  'Conduite à tenir par le consommateur': 'Ne plus consommer',
  'Numéro de contact': '0805805910',
  'Modalités de compensation': 'Remboursement',
  'Date de fin de la procédure de rappel': 'samedi 26 février 2022'},...]

Collectives™ on Stack Overflow

Scrape html links Python

3 Answers 3

4 Comments

1 Comment

Example

Output

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Example

Output

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related