HTML table to pandas table: Info inside html tags

Question

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

When I convert this to pandas using pd.read_html(tbl) the output is like this:

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

Is there a simple way of accessing this information?

Enhancement issue for your question for pandas.read_html is located here: github.com/pandas-dev/pandas/issues/14608 github.com/pandas-dev/pandas/issues/13141 — Gabe
– Gabe, Commented Mar 16, 2022 at 23:51

unutbu · Accepted Answer · 2015-08-02 13:26:49Z

10

Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

Using lxml, you could extract the attribute values with XPath:

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)

yields

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01

The above may be useful since it requires only a few extra lines of code to add the refname column.

But both LH.fromstring and pd.read_html parse the HTML. So it's efficiency could be improved by removing pd.read_html and parsing the table once with LH.fromstring:

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

yields

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

edited Aug 2, 2015 at 13:26

answered Aug 2, 2015 at 12:44

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

iayork Over a year ago

Thanks. This exact approach doesn't work in my case, because other cells also have href tags that get picked up by the xpath; but given that I have to perform the extra step no matter what, I pulled the UID out using a regex and then populated the new columns with that.

unutbu Over a year ago

Glad you solved the problem! Be careful parsing HTML with regex though; it may work in many cases, but it is hard to make robust.

iayork Over a year ago

Understood. In this case I'm not really parsing the html, just looking for the text in the full URL that indicates the uid. It's more fragile than I prefer but these tables should have a consistent structure that makes it relatively safe.

k-nut · Accepted Answer · 2015-08-02 12:10:58Z

6

You could simply parse the table manually like this:

import BeautifulSoup
import pandas as pd

TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""

table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
    trs = tr.findAll("td")
    record = []
    record.append(trs[0].text)
    record.append(trs[1].a["href"])
    record.append(trs[2].text)
    records.append(record)

df = pd.DataFrame(data=records)
df

which gives you

     0                 1   2
0  265  /j/jones03.shtml  29
1  266  /s/smith01.shtml  34

answered Aug 2, 2015 at 12:10

k-nut

3,6152 gold badges22 silver badges28 bronze badges

1 Comment

iayork Over a year ago

Thanks for the suggestion. The table is fairly large and there are many cells in each row, so I'd rather avoid manual lifting if possible (and this is hard to generalize), but will fall back to this if there's no simpler solution.

Giulio Genovese · Accepted Answer · 2016-07-24 21:23:20Z

4

You could use regular expressions to modify the text first and remove the html tags:

import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
pd.read_html(tbl)

which gives you

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]

answered Jul 24, 2016 at 21:23

Giulio Genovese

2,8711 gold badge18 silver badges13 bronze badges

Comments

Gabe · Accepted Answer · 2022-08-24 19:36:06Z

2

This available now in Pandas 1.5.0+ using the extract_links parameter.

extract_links - possible options: {None, “all”, “header”, “body”, “footer”}

Table elements in the specified section(s) with tags will have their href extracted.

Documentation

Example

html_table = """
<table>
<tr>
  <th>GitHub</th>
</tr>
<tr>
  <td><a href="https://github.com/pandas-dev/pandas">pandas</a> 
</td>
</tr>
</table>
"""


df = pd.read_html(
  html_table,
  extract_links="all"
)[0]

edited Aug 24, 2022 at 19:36

answered Aug 24, 2022 at 14:23

Gabe

6,31213 gold badges65 silver badges103 bronze badges

Collectives™ on Stack Overflow

HTML table to pandas table: Info inside html tags

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related