1

I am currently trying to scrape my internet providers data usage. I tried looking for an api of sorts but they don't have one. I am resorting to scraping the html whch looks like this

</tr><tr class="top-border"><td>17&nbsp;&nbsp;Monday</td><td class='text-right'><span class='mb'>2,991.69&nbsp;MB</span><span class='gb'>2.92&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>1,232.04&nbsp;MB</span><span class='gb'>1.20&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>4,223.73&nbsp;MB</span><span class='gb'>4.12&nbsp;GB</span></td>         <td>
            <div class="progress"><div class="bar bar-success" style="width: 51%;"></div></div>         </td>

        </tr><tr><td>18&nbsp;&nbsp;Tuesday</td><td class='text-right'><span class='mb'>3,589.42&nbsp;MB</span><span class='gb'>3.51&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>1,199.58&nbsp;MB</span><span class='gb'>1.17&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>4,789.00&nbsp;MB</span><span class='gb'>4.68&nbsp;GB</span></td>           <td>
            <div class="progress"><div class="bar bar-success" style="width: 57%;"></div></div>         </td>

ect

I tried to use pythons re.search but I can only get a bit of info out of it. eg:

search = re.search("class='gb'>(.*)&nbsp;GB</span>",raw_info)
for i in range(0,100):
    try:
        print(search.group(i))
    except:
        break

output:

class='gb'>6.88&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>
1,295.90&nbsp;MB</span><span class='gb'>1.27&nbsp;GB</span></td></td><td class='
text-right'><span class='mb'>8,340.12&nbsp;MB</span><span class='gb'>8.14&nbsp;G
B</span>
6.88&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>1,295.90&nb
sp;MB</span><span class='gb'>1.27&nbsp;GB</span></td></td><td class='text-right'
><span class='mb'>8,340.12&nbsp;MB</span><span class='gb'>8.14

I learned I can't use groups like that to print out all of the numbers

tldr: I need to print all the numbers referring to gb and print them like this

2.92,1.20,4.12

3.51,1.17,4.68

1
  • Word of advice, never use regex on HTML. See this answer Commented Oct 28, 2016 at 20:46

1 Answer 1

3

You might want to try using BeautifulSoup, it's a very flexible library which can do exactly what you are looking for.

html = scraped
soup = BeautifulSoup(html)
spans = soup.findAll('span', attrs={'class': 'gb'})

You will then have a list of all the span tags that have the gb class. Producing the numbers and converting them to floats then applying whatever format you want to print them in is fairly simple.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.