0

correct output using below txt file should be: PlayerA 29.2 PlayerB 32.2

I have a txt file filled with html that looks like below, I'm trying to use a python 2.6 regular expression to collect all the playernames and ratings.

The first time the playername appears is on line 4, the rating appears on line 16.(29.2)

Then the next player name appears on line 22, the rating on line 35. and so on...

fileout = open('C:\Python26\hotcold.txt')
read_file = fileout.readlines()
source = str(read_file)

expression = re.findall(r"(LS=113>.+?", source)
print expression

I was trying to make a expression that would find all the names and ratings but it isnt working..

<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&amp;P=245&amp;LS=113">
PlayerA
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
 4
</b>
,
<b>
 8
 </b>
</td>
<td class="stats" colspan="1" valign="top">
29.2
</td>

<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&amp;P=245&amp;LS=113">
PlayerB
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
 4
</b>
,
<b>
 8
 </b>
</td>
<td class="stats" colspan="1" valign="top">
32.2
</td>
5
  • 4
    Consider using BeautifulSoup? Commented Jul 4, 2015 at 18:34
  • Yeah I used BS to get the html, but I don't know how to just pick those specific parts of the text file. Commented Jul 4, 2015 at 18:37
  • How did you use BeautifulSoup to get the hml? Commented Jul 4, 2015 at 19:43
  • Sorry used soup to prettify and find_all tr,class,a Commented Jul 4, 2015 at 21:34
  • correct output using above html should be: PlayerA 29.2 PlayerB 32.2 Commented Jul 4, 2015 at 21:34

2 Answers 2

2

I would recommend using Beautiful Soup to parse the HTML and get the values you are after.

Use the following code:

from bs4 import BeautifulSoup

with open('sample.html', 'r') as html_doc:

    soup = BeautifulSoup(html_doc, 'html.parser')

    for row in soup.find_all('tr', 'stats'):        
        row_tds = row.find_all_next('td')
        print('{0} {1}'.format(
            row_tds[0].find('a').string.strip() if row_tds[0].find('a').string else 'None', 
            row_tds[2].string.strip() if row_tds[2].string else 'None')
        )

output:

$ python testparse.py
PlayerA 29.2
PlayerB 32.2

Works.

Sign up to request clarification or add additional context in comments.

6 Comments

The html I'm using is in the code section of my question. it's only html saved in a text file.
AttributeError: 'NoneType' object has no attribute 'strip'
added a check for a None string (you should have mentioned that it was a possibility)
That is so awesome, I have all the names now I just need the corresponding stat beside each name ie playernamea 28.9 playernameb 32.4
correct output using above html should be: PlayerA 29.2 PlayerB 32.2
|
0

Alternatively, I would suggest using a proper html parser instead of relying on regex -- although BeautifulSoup is actually a very good and easy-to-use library.

In your sample is that missing the closing <tr> tags between the <td>?

Edit: using OP sample as source

Anyhow, and using lxml.html with simple xpath to get hopefully what you expected:

In [1]: import lxml.html

# sample.html is the same as in OP sample
In [2]: tree = lxml.html.parse("sample.html")

In [3]: root = tree.getroot()

In [4]: players = root.xpath('.//td[@class="stats"]/a/text()')

In [5]: stats = root.xpath('//td[@class="stats" and normalize-space(text())]/text()')

In [6]: print players, stats
['\nPlayerA\n', '\nPlayerB\n'] ['\n29.2\n', '\n32.2\n']

In [7]: for player, stat in zip(players, stats):
   ...:     print player.strip(), stat.strip()
   ...:
PlayerA 29.2
PlayerB 32.2

6 Comments

Problem is the html just looks like what I have above, its all in a .txt file
@user3496483, as using lxml.html the parser is more tolerated to broken html. So in your use case it is still possible to parse and grab the results as expected, just need to strip() the text afterward.
Traceback (most recent call last): File "C:\Python26\hotcoldparser.py", line 28, in <module> lxml.html.parse(mike_file) File "C:\Python26\Lib\site-packages\lxml\html_init_.py", line 692, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187) File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485) File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)
@user3496483, I'm not familiar to windows environment, but perhaps you can try import lxml.html as html, then change the following lines using html.parse(...)
only problem is this gives me the first int found, not the correct location.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.