2

I want to scrap information on different pages of the same site, societe.com and I have several questions.

first of all here is the code that I managed to do, I am a bit of a novice I admit it

I only put 2 URLs to see if the loop worked and some information, I can add some when everything works

urls = ["https://www.societe.com/societe/decathlon-france-500569405.html","https://www.societe.com/societe/go-sport-312193899.html"]
for url in urls:
    response = requests.get(url, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
    soup = BeautifulSoup(response.text, "html.parser")
    numrcs = soup.find("td", class_="numdisplay")
    nomcommercial = soup.find("td", class_="break-word")
    print(nomcommercial.text)
    print(numrcs.text.strip())
    numsiret = soup.select('div[id^=siret_number]')
    for div in numsiret:
        print(div.text.strip())
    formejuri = soup.select('div[id^=catjur-histo-description]')
    for div in formejuri:
        print(div.text.strip())
    infosend = {
        'numrcs': numrcs,
        'nomcommercial':nomcommercial,
        'numsiret':numsiret,
        'formejuri':formejuri
    }
    tableau.append(infosend)
print(tableau)

my_infos = ['Numéro RCS',  'Numéro Siret ','Forme Juridique']

my_columns = [
    np.tile(np.array(my_infos), len(nomcommercial))
]

df = pd.DataFrame( tableau,index=nomcommercial, columns=my_columns)
df

When I run the loop I have the right information coming out, like for example

DECATHLON FRANCE
Lille Metropole B 500569405
50056940503239
SASU Société par actions simplifiée à associé unique

but I would like to put all this information in a table but I can't really, only the last company appears and the data makes no sense I tried to follow a tutorial without success.

if you can help me i would be really happy

1 Answer 1

2

To get data about the companies you can use next example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


urls = [
    "https://www.societe.com/societe/decathlon-france-500569405.html",
    "https://www.societe.com/societe/go-sport-312193899.html",
]

headers = {
    "User-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
}

data = []
for url in urls:
    soup = BeautifulSoup(
        requests.get(url, headers=headers).content, "html.parser"
    )
    title = soup.select_one("#identite_deno").get_text(strip=True)
    rcs = soup.select_one('td:-soup-contains("Numéro RCS") + td').get_text(
        strip=True
    )
    siret_number = soup.select_one("#siret_number").get_text(strip=True)
    form = soup.select_one("#catjur-histo-description").get_text(strip=True)

    data.append([title, url, rcs, siret_number, form])


df = pd.DataFrame(
    data,
    columns=["Title", "URL", "Numéro RCS", "Numéro Siret", "Forme Juridique"],
)
print(df.to_markdown())

Prints:

Title URL Numéro RCS Numéro Siret Forme Juridique
0 DECATHLON FRANCE (DECATHLON DIRECTION GENERALE FRANCE) https://www.societe.com/societe/decathlon-france-500569405.html Lille Metropole B 500569405 50056940503239 SASU Société par actions simplifiée à associé unique
1 GO SPORT https://www.societe.com/societe/go-sport-312193899.html Grenoble B 312193899 31219389900191 Société par actions simplifiée
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.