Scrape table and save in excel - Save URL text and href in separate columns [python, beautifulsoup]

Question

I've utilized an answer in this post and have the following code working.

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://www.test.com/")
browser.find_by_tag('table[class='test-table']')
soup = BeautifulSoup(browser.html, 'html.parser')

columns = [i.get_text(strip=True) for i in soup.table.find_all("th")]
data = []

df = pd.DataFrame(data,columns=columns)
#add new column for the url
df.insert(loc=1,column='url',value="url")

for tr in soup.table.find("tbody").find_all("tr"):
    data.append([td.get_text(strip=True) for td in tr.find_all("td")])

df.to_excel("data.xlsx", index=False)

I am trying to modify it saving urls into two different columns in my dataframe

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://www.test.com/")
browser.find_by_tag('table[class='test-table']')
soup = BeautifulSoup(browser.html, 'html.parser')

columns = [i.get_text(strip=True) for i in soup.table.find_all("th")]
data = []

df = pd.DataFrame(data,columns=columns)
#add new column for the url
df.insert(loc=1,column='url',value="url")

for tr in soup.table.find("tbody").find_all("tr"):
    for td in tr.find_all("td"):
        
     #check td for url, if there, save the href and text into separate columns


df.to_excel("data.xlsx", index=False)

In the td for loop i want to check the td for a url ( tag) and if its there, save the text ("Google Search") AND the href ("http://www.google.com") inside the current td. If it is not there then save the current td text to the approprate dataframe column.

example table that i am trying to parse and save to excel

<table class="test-table">
<thead>
    <tr>
        <th>Site</th>
        <th>Name</th>
        <th>Addr</th>
        <th>City</th>
        <th>World</th>
    </tr>
</thead>
 <tbody>
   <tr>
    <td><a href="http://www.google.com">Google Search</a></td>
    <td>John Doe</td>
    <td>555 Smart St</td>
    <td>North Pole</td>
    <td>Earth</td>
   </tr>
 </tbody>

Desired output excel columns | Site name | URL | Name | address | city | World | | -------- | -------------- | -------------- | -------------- | -------------- | -------------- | | Google Search| https://www.google.com |John Doe |555 Smart St |North Pole |Earth |

In the td loop, i need help doing:

checking the current td for the presence of the /href tag
If its a URL, split its parts and save the text and the href to separate dataframe columns
If not a url, continue to save text to dataframe

Thanks in advance!

HedgeHog · Accepted Answer · 2022-08-19 22:34:44Z

You could iterate the rows build a dict, append it to a list and create your dataframe:

data = []

for row in soup.select('table tr:has(td)'):
    d = dict(zip([h.text for h in soup.select('table tr th')],list(row.stripped_strings)))
    d.update({'Url':row.td.a.get('href') if row.td.a else None})
    data.append(d)
pd.DataFrame(data).to_excel('file.xlsx', index=False)

Example

from bs4 import BeautifulSoup
import pandas as pd
html = '''
<table class="test-table">
<thead>
    <tr>
        <th>Site</th>
        <th>Name</th>
        <th>Addr</th>
        <th>City</th>
        <th>World</th>
    </tr>
</thead>
 <tbody>
   <tr>
    <td><a href="http://www.google.com">Google Search</a></td>
    <td>John Doe</td>
    <td>555 Smart St</td>
    <td>North Pole</td>
    <td>Earth</td>
   </tr>
   <tr>
    <td>NOPE</td>
    <td>John Doe</td>
    <td>555 Smart St</td>
    <td>South Pole</td>
    <td>Earth</td>
   </tr>
 </tbody>
'''

soup=BeautifulSoup(html)

data = []

for row in soup.select('table tr:has(td)'):
    d = dict(zip([h.text for h in soup.select('table tr th')],list(row.stripped_strings)))
    d.update({'Url':row.td.a.get('href') if row.td.a else None})
    data.append(d)
pd.DataFrame(data).to_excel('file.xlsx', index=False)

Output

Site	Name	Addr	City	World	Url
Google Search	John Doe	555 Smart St	North Pole	Earth	http://www.google.com
NOPE	John Doe	555 Smart St	South Pole	Earth

Collectives™ on Stack Overflow

Scrape table and save in excel - Save URL text and href in separate columns [python, beautifulsoup]

1 Answer 1

Example

Output

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Example

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related