0

From what I gather, it is generally considered a bad idea to parse html in Bash. But a person never learns to ride a bike without also falling a few times in the process.

And so, using Bash, I'm trying to extract some data from an html webpage. The relevant pieces I am trying to obtain are data-nick="someguy99"which is a username and then the message "Hello. This is the data I wish to obtain." displayed on the line directly underneath.

<body>
 <div id="main">
  <div class="content">
   <div class="block">
    <div class="section">
     <div class="chat-holder">             
      <div class="chat-box">  
       <div class="chat-list">
        <div id="0" class="text" style="color: rgb(73, 73, 73);">
         <span class="username messagelabel" data-nick="someguy99">someguy99:</span>  
         "Hello. This is the data I wish to obtain."

Using wget I have not been able to traverse past "chat-list". I have tried piping the output to other programs wget -O - http://website.url | lynx -source -dump But nothing is working. Always the same output. For instance:

wget --quiet -F -O - http://website.url/example | \
lynx -dump -source -stdin | grep 'chat-list'

and the result...

        var img = $('.chat-list img[title="' + slug + '"]');

This is not the same as the output seen in the document tree when using a web browser. And replacing grep 'chat-list' with grep 'data-nick' returns no matching patterns at all.

What am I doing wrong? How do I parse deeper to obtain the data I seek?

My brain feels a bit fried right now so If I left out any relevant information just let me know and I'll provide more details.

  • Mac OS X 10.11.5
  • GNU bash 4.3.42

Thank you.

9
  • 1
    You say "always the same output" but you neglect to mention what that output might be. "It's not working" is never an adequate problem statement. Commented May 18, 2016 at 21:01
  • Although actually if that is the real data (what on earth is = $0 doing there?) then it's evident that html parsers will have trouble with the missing double quotes in the class attributes (section and text, in your extract). That would make it tricky to read the page with a browser, too. Can you get the owner of the page to fix their markup? Commented May 18, 2016 at 21:16
  • You're right and my apologies for that. I have edited the question to show the output I am getting. And that = $0 was not supposed to be there. My mistake. The missing quotes were just typing errors on my part. Commented May 19, 2016 at 2:33
  • 1
    Could you please clarify where the text in your question comes from? Is it the actual text returned by the webserver, or did you get it by inspecting the DOM from an actual web page? There is a big difference between using a modern browser's "inspect" function, and using "view source". Commented May 19, 2016 at 4:30
  • ... if you did want to grep the source of the web page, you wouldn't need lynx in there; you could just pipe the wget directly into grep. So my guess is that you're trying to get at a DOM which has been assembled by javascript running in your browser. That's quite a different problem from parsing. Commented May 19, 2016 at 4:33

2 Answers 2

2

Sadly, what you see in Safari's Web Inspector is not the text of the HTML page. It is the result of the browser interpreting the page, possibly including execution of embedded Javascript programs and data read from other pages. In addition, the Web Inspector shows you a fully nested tree structure, even though the original HTML may have been missing close tags and even some start tags: a classic example of this is that you will always see <tbody> elements inside <table> elements, even though the HTML page contains not a single element with the tbody tag.

So it is not really surprising that wget and wget | lynx -source show you the same data, and that piping that through grep does not find the line you see in the Web Inspector. That line simply does not exist in the source of the webpage; it is the result of Web Inspector interpreting the internal representation of an assembled page object.

As far as I know, none of the common text-mode browsers implement Javascript, although there is some experimental support. Furthermore, (again, as far as I know for common text-mode browsers), there is no support for dumping the DOM ("Domain Object Model"; that is, the actual object tree shown by the Web Inspector). Text-mode browsers tend to give you the option of -dump to show the rendered output as text or -source to show the original HTML file.

In my opinion, the best way of handling client-generated pages -- that is, pages which are assembled during page loading by the local web browser -- is to use a headless browser such as PhantomJS (there are others listed in the Wikipedia article, but I only have experience with PhantomJS). Alternatively, you could try a browser automation tool such as Selenium which will let you script your browser. Or, on Mac OS X, you might be able to use Applescript to script the Safari browser. (I don't have a Mac any more, but the Safari Applescript dictionary shows that you can open a URL and do javascript to execute javascript within that page.)

Unfortunately, none of these techniques are well-documented (IMHO) and what documentation exists tends to focus on unit-testing web pages (which is a very important use case, but not necessarily related to data scraping). I found PhantomJS to be surprisingly annoying to get started with until I figured out that any syntax error in the javascript you try to execute inside the webpage causes PhantomJS to simply hang, without creating any error message. So it's vital that you use some other javascript interpreter such as Node to syntax check your scripts before trying them in PhantomJS.

Inside a javascript program running in a webpage, you can usually use JQuery to navigate, which makes finding content based on attribute values (as in your question) really easy. For cases in which the page does not already import JQuery, PhantomJS provides a mechanism which injects JQuery into the page for you, but I've never had to use that.

Good luck with your project.

Sign up to request clarification or add additional context in comments.

5 Comments

I guess that explains why the tree in Chrome looks different from the one in Safari. Well I've got a lot of reading to do it seems. I actually have JQuery installed though I understand very little if any of it to be honest. I know nothing of Javascript at this point but I have a feeling that is about to change. In the last paragraph of your answer - are you saying that JQuery can be used to find values defined by the user even if those values are not specifically mentioned in the API? I don't even know if that question makes sense. Thank you for giving such a thorough and detailed answer.
@user556068: I fear a complete answer won't fit in an SO post, never mind a comment. Maybe this is the book project I've been looking for :) The best way to think of it is that Javascript programs run inside of web pages, since every web page is effectively a sandbox and has limited interaction with the outside world (aside from showing you the results). So it makes little sense to "have JQuery" on your local machine. If a webpage needs JQuery, it will get it from the web. In a script running in a webpage, JQuery makes it easy to search. For example...
... to find a <span> with an attribute data-nick whose value is someguy99, you can just use $("span[data-nick='someguy99']"). However, the tricky bit is getting the text following that element, because it is not wrapped in anything. You could get the text of the span's parent element by appending .parent().text(), but that would include the text inside the span itself. Maybe that's good enough :) Once you have the text, you need to print it out. You can't use console.log() because that won't work inside a webpage. PhantomJS has a console proxy you can use, though...
.... which involves installing a console log handler in the outer phantomjs script, so once you have that set up, you could just do console.log($("span[data-nick='someguy99']").parent().text(). Because of the way jquery works, that will apply to every matching span in the document.... I think I'll stop there. As I said, good luck.
Yes I think a book would be a good idea. I was assuming (wrongly it seems) that JQuery was the same as the jq program I download from homebrew not too long ago.. it appears most of my assumptions to this point have been wrong. Thank you for taking the time to talk to me. You have given me a great deal of invaluable information. If you really do write a book I'll be first in line when it comes out. Until then I think part of my days will be spent in the corner of the room curled up in the fetal position. Thank you again.
1

I took your HTML fragment and wrote it to a tmp file. I then constructed a regex based on your requirements using Rubular.com, then I ran grep -P over it and the result was close:

#> grep -Pzo 'data-nick="[^>](.+|\n)[^"|\n]+"' /tmp/test.html
data-nick="someguy99"

However, what you need is some way to cover multiple lines and I thought the |\n would do that, but not quite - sorry! I'm using Ubuntu 14.04 and switched grep into PCRE (Non POSIX mode) so you might want to specify your O/S and bash version, as I believe there are different versions of grep around on different sytems.

1 Comment

Thanks for this. I believe this would work if my data had been what I thought it was. Turns out it was not. But thank you for the effort.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.