1

I'd like to create a script that grabs two values from this awful HTML published on a city website:

558.35

and

66.0

These are water reservoir details and change weekly.

I'm unsure what the best tool to do this is, grep?

Thanks for your suggestions, ideas!

<table>
    <tbody>
        <tr>
            <td>&nbsp;Currently:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 558.35</td>
        </tr>
        <tr>
            <td>&nbsp;Percent of capacity:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;66.0%</td>
        </tr>
    </tbody>
</table>
4
  • If you use PHP then you could use DOMDocument. Commented Dec 23, 2015 at 3:38
  • Is this a skill you hope to improve on? Then learn python-scrapy, beautifulSoup and others. Python has a healthy eco-system for web scraping, but as html gets more baroque, you'll have to keep that skill up-to-date to be meaningful. If you're looking just to grab those 2 values and won't be doing anything else for years, the post a bounty for an xmllint or xmlstarlet solution. If its really this simple, you might also find an awk solution, but once that data proves more complex than what you've indicated here, all bets are off ;-) Good luck. Commented Dec 23, 2015 at 3:43
  • Thanks, these are solutions I'll explore! Commented Dec 23, 2015 at 3:49
  • 1
    Regular expression is the worst tool to parse/scrape HTML. you may be interested in this link Commented Dec 23, 2015 at 3:56

1 Answer 1

2

if you are using regex you can use sed

sed -nr 's#^[ ]*<td>.*;[ ]?([0-9]+[.][0-9]+)[%]?</td>[ ]*$#\1#p' my_html_file

An Htmlparser such as python's module BeautifulSoup or a javascript approach is a safer choice

EDIT:

Here is a snippet using javascript..results is logged to the console and an alert box pops up to show results

var values="";
for(i=1;i<document.getElementsByTagName('td').length;++i){
values+=" "+document.getElementsByTagName('td')[i].innerHTML.replace(/&nbsp;|Percent of capacity:|[ %]/g,"")
}
alert(values);
console.log(values);
<table>
    <tbody>
        <tr>
            <td>&nbsp;Currently:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 558.35</td>
        </tr>
        <tr>
            <td>&nbsp;Percent of capacity:</td>
            <td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;66.0%</td>
        </tr>
    </tbody>
</table>

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.