I've utilized an answer in this post and have the following code working.
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://www.test.com/")
browser.find_by_tag('table[class='test-table']')
soup = BeautifulSoup(browser.html, 'html.parser')
columns = [i.get_text(strip=True) for i in soup.table.find_all("th")]
data = []
df = pd.DataFrame(data,columns=columns)
#add new column for the url
df.insert(loc=1,column='url',value="url")
for tr in soup.table.find("tbody").find_all("tr"):
data.append([td.get_text(strip=True) for td in tr.find_all("td")])
df.to_excel("data.xlsx", index=False)
I am trying to modify it saving urls into two different columns in my dataframe
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://www.test.com/")
browser.find_by_tag('table[class='test-table']')
soup = BeautifulSoup(browser.html, 'html.parser')
columns = [i.get_text(strip=True) for i in soup.table.find_all("th")]
data = []
df = pd.DataFrame(data,columns=columns)
#add new column for the url
df.insert(loc=1,column='url',value="url")
for tr in soup.table.find("tbody").find_all("tr"):
for td in tr.find_all("td"):
#check td for url, if there, save the href and text into separate columns
df.to_excel("data.xlsx", index=False)
In the td for loop i want to check the td for a url ( tag) and if its there, save the text ("Google Search") AND the href ("http://www.google.com") inside the current td. If it is not there then save the current td text to the approprate dataframe column.
example table that i am trying to parse and save to excel
<table class="test-table">
<thead>
<tr>
<th>Site</th>
<th>Name</th>
<th>Addr</th>
<th>City</th>
<th>World</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://www.google.com">Google Search</a></td>
<td>John Doe</td>
<td>555 Smart St</td>
<td>North Pole</td>
<td>Earth</td>
</tr>
</tbody>
Desired output excel columns | Site name | URL | Name | address | city | World | | -------- | -------------- | -------------- | -------------- | -------------- | -------------- | | Google Search| https://www.google.com |John Doe |555 Smart St |North Pole |Earth |
In the td loop, i need help doing:
- checking the current td for the presence of the /href tag
- If its a URL, split its parts and save the text and the href to separate dataframe columns
- If not a url, continue to save text to dataframe
Thanks in advance!