0

I am extracting a table from my customer website and I need to parse this HTML into a Pandas dataframe. However, on the table I want to store all the HREFs into my dataframe. My HTML has the following schema:

<table>
    <tr>
         <th>Col_1</th>
         <th>Col_2</th>
         <th>Col_3</th>
         <th>Col_4</th>
         <th>Col_5</th>
         <th>Col_6</th>
         <th>Col_7</th>
         <th>Col_8</th>
         <th>Col_9</th>
    </tr>
    <tr>
         <td>Office</td>
         <td>Office2</td>
         <td>Customer</td>
         <td></td>
         <td><a href="test12345_163">New Doc</a><br><a href="test12345_163">my_work.jpg</a></td>
         <td><a href="test12345_163">Person_2</a><br><a href="test12345_163">Person_3</a><br><a href="test12345_163">Person 3</a></td>
         <td><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a></td>
         <td>STATUS</td>
         <td>9030303</td>
    </tr>
</table>

I have this code:

soup = BeautifulSoup(page.content, "html.parser")

html_table = soup.find('table')

df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]

I am just trying to create a column with all the links from each index (if has more than one, then group it). But when I run this code I got:

Length of values (1102) does not match length of index (435)

What I am doing wrong?

Thanks!

0

2 Answers 2

1

You don't need read_html, and the Dataframe should be defined like this:

html_table = soup.find('table')
hyperlinks=soup.find_all("a")
l=[]
for a in hyperlinks:
    l.append([a.text,a.get("href")])
pd.DataFrame(l,columns=["Names","Links"])

enter image description here

Update:

#here we get headers:
headers=[]
html_table = soup.find('table')
trs=html_table.find_all("tr")
headers=[th.text for th in trs[0].find_all("th")]
#an empty dataframe with all headers as columns and one row index:
df=pd.DataFrame(columns=headers,index=[0])
#here we get contents:
body_td=trs[1].find_all("td")
i=0
for td in body_td:
    HyperLinks=td.find_all("a")
    cell=[a.get("href") for a in HyperLinks]
    df.iloc[0,i]=cell
    i+=1

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @yasharov but with that code I only get 2 columns there are 9 in total
If you are looking for having href values in 9 columns I would rewrite the code.
In the updated code In the case you want to have texts you need just repleace a.get("href") by a.text
0

You could grab the links before looping the tds using a list comprehension to get all hrefs for a given row; grab all the td text into a list and extend that list with a nested list of one item, which is the list of hrefs you previously collected:

from bs4 import BeautifulSoup as bs
import pandas as pd

html = '''<table>
    <tr>
         <th>Col_1</th>
         <th>Col_2</th>
         <th>Col_3</th>
         <th>Col_4</th>
         <th>Col_5</th>
         <th>Col_6</th>
         <th>Col_7</th>
         <th>Col_8</th>
         <th>Col_9</th>
    </tr>
    <tr>
         <td>Office</td>
         <td>Office2</td>
         <td>Customer</td>
         <td></td>
         <td><a href="test12345_163">New Doc</a><br><a href="test12345_163">my_work.jpg</a></td>
         <td><a href="test12345_163">Person_2</a><br><a href="test12345_163">Person_3</a><br><a href="test12345_163">Person 3</a></td>
         <td><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a></td>
         <td>STATUS</td>
         <td>9030303</td>
    </tr>
</table>'''
soup = bs(html, 'lxml')
results = []
headers = [i.text for i in soup.select('table th')]
headers.append('Links')

for _row in soup.select('table tr')[1:]:
    row = []
    links = [i['href'] for i in _row.select('a')]
    for _td in _row.select('td'):
        row.append(_td.text)
    row.extend([links])
    results.append(row)

df = pd.DataFrame(results, columns = headers)
df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.