0

I've utilized an answer in this post and have the following code working.

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://www.test.com/")
browser.find_by_tag('table[class='test-table']')
soup = BeautifulSoup(browser.html, 'html.parser')

columns = [i.get_text(strip=True) for i in soup.table.find_all("th")]
data = []

df = pd.DataFrame(data,columns=columns)
#add new column for the url
df.insert(loc=1,column='url',value="url")

for tr in soup.table.find("tbody").find_all("tr"):
    data.append([td.get_text(strip=True) for td in tr.find_all("td")])

df.to_excel("data.xlsx", index=False)

I am trying to modify it saving urls into two different columns in my dataframe

import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get("https://www.test.com/")
browser.find_by_tag('table[class='test-table']')
soup = BeautifulSoup(browser.html, 'html.parser')

columns = [i.get_text(strip=True) for i in soup.table.find_all("th")]
data = []

df = pd.DataFrame(data,columns=columns)
#add new column for the url
df.insert(loc=1,column='url',value="url")

for tr in soup.table.find("tbody").find_all("tr"):
    for td in tr.find_all("td"):
        
     #check td for url, if there, save the href and text into separate columns


df.to_excel("data.xlsx", index=False)

In the td for loop i want to check the td for a url ( tag) and if its there, save the text ("Google Search") AND the href ("http://www.google.com") inside the current td. If it is not there then save the current td text to the approprate dataframe column.

example table that i am trying to parse and save to excel

<table class="test-table">
<thead>
    <tr>
        <th>Site</th>
        <th>Name</th>
        <th>Addr</th>
        <th>City</th>
        <th>World</th>
    </tr>
</thead>
 <tbody>
   <tr>
    <td><a href="http://www.google.com">Google Search</a></td>
    <td>John Doe</td>
    <td>555 Smart St</td>
    <td>North Pole</td>
    <td>Earth</td>
   </tr>
 </tbody>

Desired output excel columns | Site name | URL | Name | address | city | World | | -------- | -------------- | -------------- | -------------- | -------------- | -------------- | | Google Search| https://www.google.com |John Doe |555 Smart St |North Pole |Earth |

In the td loop, i need help doing:

  1. checking the current td for the presence of the /href tag
  2. If its a URL, split its parts and save the text and the href to separate dataframe columns
  3. If not a url, continue to save text to dataframe

Thanks in advance!

1 Answer 1

0

You could iterate the rows build a dict, append it to a list and create your dataframe:

data = []

for row in soup.select('table tr:has(td)'):
    d = dict(zip([h.text for h in soup.select('table tr th')],list(row.stripped_strings)))
    d.update({'Url':row.td.a.get('href') if row.td.a else None})
    data.append(d)
pd.DataFrame(data).to_excel('file.xlsx', index=False)

Example

from bs4 import BeautifulSoup
import pandas as pd
html = '''
<table class="test-table">
<thead>
    <tr>
        <th>Site</th>
        <th>Name</th>
        <th>Addr</th>
        <th>City</th>
        <th>World</th>
    </tr>
</thead>
 <tbody>
   <tr>
    <td><a href="http://www.google.com">Google Search</a></td>
    <td>John Doe</td>
    <td>555 Smart St</td>
    <td>North Pole</td>
    <td>Earth</td>
   </tr>
   <tr>
    <td>NOPE</td>
    <td>John Doe</td>
    <td>555 Smart St</td>
    <td>South Pole</td>
    <td>Earth</td>
   </tr>
 </tbody>
'''

soup=BeautifulSoup(html)

data = []

for row in soup.select('table tr:has(td)'):
    d = dict(zip([h.text for h in soup.select('table tr th')],list(row.stripped_strings)))
    d.update({'Url':row.td.a.get('href') if row.td.a else None})
    data.append(d)
pd.DataFrame(data).to_excel('file.xlsx', index=False)

Output

Site Name Addr City World Url
Google Search John Doe 555 Smart St North Pole Earth http://www.google.com
NOPE John Doe 555 Smart St South Pole Earth
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.