Python Regex and Pandas

Question

I have a text in html, which later I want to convert into a pandas dataframe.

I have a text that looks like so:

<tr>
   <td -some attributes- >Val1</td>
   <td -some attributes- >Val2</td>
   <td -some attributes- >Val3</td>
</tr>
<tr>
   <td -some attributes- >Val4</td>
   <td -some attributes- >Val5</td>
   <td -some attributes- >Val6</td>
</tr>

and I have the regex: <td.*>(.*)</td> but it doesn't catches all the values, it cathces almost all the text...

after I ctach all, I put it in a dataframe.

so why this regex doesn't catch the values as it should?

I'd recommend beautifulsoup instead of regex pypi.python.org/pypi/beautifulsoup4 ..... also show the actual code you tried to use — depperm
– depperm, Commented May 10, 2017 at 14:20
It could be that you look at each row, one at a time and that a value spans multiple rows, or something completely different. I second the previous comment. Use beautifulsoup to parse html. — JohanL
– JohanL, Commented May 10, 2017 at 14:23
Your RegEx <td.*>(.*)</td> is greedy (see (documentation](docs.python.org/3.6/library/re.html) ). So it captures more than necessary. — Laurent LAPORTE
– Laurent LAPORTE, Commented May 10, 2017 at 14:37

Dinu Duke · Accepted Answer · 2017-05-10 17:00:17Z

1

You can try like this instead of REGEX - just an opinion

import pandas as pd
movies_table = pd.read_html("xxx.y.com")
movies = movies_table[0] // select the correct table from the tables array.

I got this working with me. Below I have attached a sample for use.

edited May 10, 2017 at 17:00

answered May 10, 2017 at 14:34

Dinu Duke

18513 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Laurent LAPORTE · Accepted Answer · 2017-05-10 14:35:24Z

0

If you (really) want to use a RegEx, you can do as follow:

content = """\
<tr>
   <td -some attributes- >Val1</td>
   <td -some attributes- >Val2</td>
   <td -some attributes- >Val3</td>
</tr>
<tr>
   <td -some attributes- >Val4</td>
   <td -some attributes- >Val5</td>
   <td -some attributes- >Val6</td>
</tr>"""

import re

td_regex = re.compile(r"<td[^>]+>"  # <td> tag
                      r"((?:(?!</td>).)+)")  # <td> content

print(td_regex.findall(content))

You'll get:

['Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6']

answered May 10, 2017 at 14:35

Laurent LAPORTE

23.2k7 gold badges64 silver badges111 bronze badges

Collectives™ on Stack Overflow

Python Regex and Pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related