0

I am trying to dissect an integer from data gathered by another beautifulsoup script I wrote. The data I get is always one of the three following:

<div id="counts"> 500 hits </div>
<div id="counts">3 hits </div>
<div id="counts"> hits </div>

The number of hits varies and is sometimes attached to the ">" and sometimes not. And other times the integer isn't there. So I wrote this script to return ONLY the number from the data (or tell me there is no number). It seems clunky and slow and I feel like there should be a faster way to do it? (in this code example, I included 'search' as one of the 3 possible outcomes of the bs scrape)

keywords = ['hits']
results = []
search = '<div id="hits"> 3 hits </div>'

num_check = False
store_next = False
words = search.split()

def is_number(results, num_check):
    while num_check <= 0:
        try:
            float(results[0])
            num_check = True
        except ValueError:
            results[0] = ''.join(filter(lambda x: x.isdigit(), results[0]))
            if results[0] == '':
                num_check = 2
    if num_check <= 1:
        print(results[0])

for word in reversed(words):
    if store_next:
        results.append(word)
        store_next = False
    elif word in keywords:
        store_next = True

is_number(results, num_check)

EDIT: sometimes (rarely) the <div></div> contains more info, such as a ping speed (0.22 seconds), which is why I can't search the entire clause for integers.

6
  • 1
    Not really an answer, but fyi, ''.join(filter(lambda x: x.isdigit(), results[0])) can be rewritten to simply filter(str.isdigit, results[0]) Commented Jan 31, 2014 at 20:32
  • It seems better to have your other script generate just the text of each tag instead of the repr of the whole Tag, no? Commented Jan 31, 2014 at 20:39
  • That doesn't seem to work. I get a TypeError: float() argument must be a string or a number on line 12 after it filters. If I try print(filter(str.isdigit, '<div id="hits">3')) I get <filter object at 0x00000000032BA160> printed. Commented Jan 31, 2014 at 20:44
  • 1
    ideone.com/v35d5Q will show you how to get a string back from filter in python 3 ... in python2 it just stays a string Commented Jan 31, 2014 at 20:48
  • 1
    @Gronk, sorry for misinforming you. I'm on python2 here. The lambda was unnecessary, but the join apparently was not Commented Jan 31, 2014 at 20:54

1 Answer 1

2

ummm maybe

search = '<div id="hits"> 3 hits </div>'
re.findall("\d+",search)

or for floats

re.findall("\d+\.?\d*",search)

if you know theres not going to be more than one at a time you could do

re.search("(\d+)",search).group(0)

here is some timing info

>>> timeit.timeit("re.search(\"(\d+)\",'<div id=\"hits\"> 3 hits </div>').group(   0)","import re",number = 1000)
0.0031895773144583472
>>> timeit.timeit("filter(str.isdigit, '<div id=\"hits\"> 3 hits </div>')",numbe   r=1000)
0.0049939576031476918
>>>
Sign up to request clarification or add additional context in comments.

4 Comments

Sorry, I should have added in my post that occasionally there is ping info like 0.22 seconds' contained in the <div></div>`. I have edited the post to add that info. I had not thought of this though and would work most of the time.
how bout re.search("(\d+) hits",search_text) that should only match the pattern shown
do you mean re.search("(\d+) hits",search)? If search = '<div id="hits">3,153 hits 476.12 seconds </div> the re. function returns <_sre.SRE_Match object at 0x00000000032FA918>
Sorry, more info: if I: s = re.search("(\d+) hits",search when I return s.group() it only gives 153 hits which means it seems to be stopping at the comma.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.