Python HTML Regex

Question

correct output using below txt file should be: PlayerA 29.2 PlayerB 32.2

I have a txt file filled with html that looks like below, I'm trying to use a python 2.6 regular expression to collect all the playernames and ratings.

The first time the playername appears is on line 4, the rating appears on line 16.(29.2)

Then the next player name appears on line 22, the rating on line 35. and so on...

fileout = open('C:\Python26\hotcold.txt')
read_file = fileout.readlines()
source = str(read_file)

expression = re.findall(r"(LS=113>.+?", source)
print expression

I was trying to make a expression that would find all the names and ratings but it isnt working..

<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&amp;P=245&amp;LS=113">
PlayerA
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
 4
</b>
,
<b>
 8
 </b>
</td>
<td class="stats" colspan="1" valign="top">
29.2
</td>

<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&amp;P=245&amp;LS=113">
PlayerB
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
 4
</b>
,
<b>
 8
 </b>
</td>
<td class="stats" colspan="1" valign="top">
32.2
</td>

Yeah I used BS to get the html, but I don't know how to just pick those specific parts of the text file. — user3496483
– user3496483, Commented Jul 4, 2015 at 18:37
correct output using above html should be: PlayerA 29.2 PlayerB 32.2 — user3496483
– user3496483, Commented Jul 4, 2015 at 21:34

Incognos · Accepted Answer · 2015-07-04 21:50:24Z

2

I would recommend using Beautiful Soup to parse the HTML and get the values you are after.

Use the following code:

from bs4 import BeautifulSoup

with open('sample.html', 'r') as html_doc:

    soup = BeautifulSoup(html_doc, 'html.parser')

    for row in soup.find_all('tr', 'stats'):        
        row_tds = row.find_all_next('td')
        print('{0} {1}'.format(
            row_tds[0].find('a').string.strip() if row_tds[0].find('a').string else 'None', 
            row_tds[2].string.strip() if row_tds[2].string else 'None')
        )

output:

$ python testparse.py
PlayerA 29.2
PlayerB 32.2

Works.

edited Jul 4, 2015 at 21:50

answered Jul 4, 2015 at 18:48

Incognos

2,0011 gold badge18 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user3496483 Over a year ago

The html I'm using is in the code section of my question. it's only html saved in a text file.

user3496483 Over a year ago

AttributeError: 'NoneType' object has no attribute 'strip'

Incognos Over a year ago

added a check for a None string (you should have mentioned that it was a possibility)

user3496483 Over a year ago

That is so awesome, I have all the names now I just need the corresponding stat beside each name ie playernamea 28.9 playernameb 32.4

user3496483 Over a year ago

correct output using above html should be: PlayerA 29.2 PlayerB 32.2

|

Anzel · Accepted Answer · 2015-07-04 20:23:25Z

0

Alternatively, I would suggest using a proper html parser instead of relying on regex -- although BeautifulSoup is actually a very good and easy-to-use library.

In your sample is that missing the closing <tr> tags between the <td>?

Edit: using OP sample as source

Anyhow, and using lxml.html with simple xpath to get hopefully what you expected:

In [1]: import lxml.html

# sample.html is the same as in OP sample
In [2]: tree = lxml.html.parse("sample.html")

In [3]: root = tree.getroot()

In [4]: players = root.xpath('.//td[@class="stats"]/a/text()')

In [5]: stats = root.xpath('//td[@class="stats" and normalize-space(text())]/text()')

In [6]: print players, stats
['\nPlayerA\n', '\nPlayerB\n'] ['\n29.2\n', '\n32.2\n']

In [7]: for player, stat in zip(players, stats):
   ...:     print player.strip(), stat.strip()
   ...:
PlayerA 29.2
PlayerB 32.2

edited Jul 4, 2015 at 20:23

answered Jul 4, 2015 at 19:42

Anzel

20.6k5 gold badges54 silver badges53 bronze badges

6 Comments

user3496483 Over a year ago

Problem is the html just looks like what I have above, its all in a .txt file

Anzel Over a year ago

@user3496483, as using lxml.html the parser is more tolerated to broken html. So in your use case it is still possible to parse and grab the results as expected, just need to strip() the text afterward.

user3496483 Over a year ago

Traceback (most recent call last): File "C:\Python26\hotcoldparser.py", line 28, in <module> lxml.html.parse(mike_file) File "C:\Python26\Lib\site-packages\lxml\html_init_.py", line 692, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187) File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485) File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)

Anzel Over a year ago

@user3496483, I'm not familiar to windows environment, but perhaps you can try import lxml.html as html, then change the following lines using html.parse(...)

user3496483 Over a year ago

only problem is this gives me the first int found, not the correct location.

|

Collectives™ on Stack Overflow

Python HTML Regex

2 Answers 2

6 Comments

Edit: using OP sample as source

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Edit: using OP sample as source

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related