1

I have a text in html, which later I want to convert into a pandas dataframe.

I have a text that looks like so:

<tr>
   <td -some attributes- >Val1</td>
   <td -some attributes- >Val2</td>
   <td -some attributes- >Val3</td>
</tr>
<tr>
   <td -some attributes- >Val4</td>
   <td -some attributes- >Val5</td>
   <td -some attributes- >Val6</td>
</tr>

and I have the regex: <td.*>(.*)</td> but it doesn't catches all the values, it cathces almost all the text...

after I ctach all, I put it in a dataframe.

so why this regex doesn't catch the values as it should?

4
  • 1
    I'd recommend beautifulsoup instead of regex pypi.python.org/pypi/beautifulsoup4 ..... also show the actual code you tried to use Commented May 10, 2017 at 14:20
  • It could be that you look at each row, one at a time and that a value spans multiple rows, or something completely different. I second the previous comment. Use beautifulsoup to parse html. Commented May 10, 2017 at 14:23
  • Give some example of tags that it does not catch. Commented May 10, 2017 at 14:27
  • Your RegEx <td.*>(.*)</td> is greedy (see (documentation](docs.python.org/3.6/library/re.html) ). So it captures more than necessary. Commented May 10, 2017 at 14:37

2 Answers 2

1

You can try like this instead of REGEX - just an opinion

import pandas as pd
movies_table = pd.read_html("xxx.y.com")
movies = movies_table[0] // select the correct table from the tables array.

I got this working with me. Below I have attached a sample for use.

Reading directly table data as DataFrame

Sign up to request clarification or add additional context in comments.

Comments

0

If you (really) want to use a RegEx, you can do as follow:

content = """\
<tr>
   <td -some attributes- >Val1</td>
   <td -some attributes- >Val2</td>
   <td -some attributes- >Val3</td>
</tr>
<tr>
   <td -some attributes- >Val4</td>
   <td -some attributes- >Val5</td>
   <td -some attributes- >Val6</td>
</tr>"""

import re

td_regex = re.compile(r"<td[^>]+>"  # <td> tag
                      r"((?:(?!</td>).)+)")  # <td> content

print(td_regex.findall(content))

You'll get:

['Val1', 'Val2', 'Val3', 'Val4', 'Val5', 'Val6']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.