Bash Script Parse HTML file

Question

I'm using a shell script to get the tracking information for a FedEx package. When I execute the script, I pass in the tracking number(a dummy number I found on the internet), and use curl:

#$1=797843158299
curl -A Mozilla/5.0 -b cookies -s "https://www.fedex.com/fedextrack/WTRK/index.html?action=track&action=track&action=track&tracknumbers=$1=1490" > log.txt

The output from the curl command is the HTML code, and the information I need is between the tag line:

<!--TRACKING CONTENT MAIN-->
<div id="container" class="tracking_main_container"></div>

Within the part is where I need to parse out the delivery information.
I am fairly new to scripting, and have tried some "| sed" suggestions I found online, but couldn't get anything to work.

I can see the html output of curl. What exactly should be the output/result of your script? — michas
– michas, Commented Dec 30, 2014 at 17:35
Probably the most robust approach is to use php's DOM parser. Though page scraping is always flaky. — arkascha
– arkascha, Commented Dec 30, 2014 at 17:36
Sorry? The tracking_main_container div is empty. Parsing its contents would give you an empty string. When the page is run in a browser, it's JavaScript that populates that div, and you're absolutely not going to be able to execute javascript from native bash without third-party tools. — Charles Duffy
– Charles Duffy, Commented Dec 30, 2014 at 17:41
...now, if you want some suggestions re: such 3rd-party tools, I'd suggest using PhantomJS -- which will mean doing your scripting in JavaScript rather than bash. — Charles Duffy
– Charles Duffy, Commented Dec 30, 2014 at 17:43

Gilles Quénot · Accepted Answer · 2014-12-30 20:16:52Z

1

This is not possible with curl or wget because the rendering final page is created with javascript. It is possible to use another tools that are javascript capable like spynner in python or phantomjs

This is a full working example to check if the status is delivered or not :

#!/usr/bin/python

useragent = "Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1"

import spynner
from lxml import etree

browser = spynner.Browser(user_agent = useragent)
browser.create_webview(False)
browser.load("https://www.fedex.com/fedextrack/WTRK/index.html?action=track&action=track&action=track&tracknumbers=797843158299")
browser.wait_load()

reddit = etree.HTML(browser.html)

try:
    print reddit.xpath('//div[@class="statusChevron_key_status bogus"]')[0].text
except:
    print "Undelivered"

OUTPUT

Delivered

edited Dec 30, 2014 at 20:16

answered Dec 30, 2014 at 17:43

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Charles Duffy Over a year ago

If you're going to make an answer with the same information I already provided in a comment, you might as well try to make it a reasonably informative answer -- providing specific suggestions of non-bash tools that could provide JavaScript execution, f'rinstance. I've already suggested PhantomJS on that count.

Gilles Quénot Over a year ago

Just edited with spynner in the same time of your comment

Charles Duffy Over a year ago

Bravo! Much more useful now. (I've been trying to do the same with Phantom, but was hitting a TypeError in Fedex's TrackingRootView minified js).

Collectives™ on Stack Overflow

Bash Script Parse HTML file

1 Answer 1

OUTPUT

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

OUTPUT

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related