0

Here is a basic HTML table :

<table>
  <thead>
    <td class="foo">bar</td>
  </thead>
  <tbody>
    <td>rows</td>
    …
  </tbody>
</table>

Suppose there are several such tables in the source file. Is there an option of hxextract, or a CSS3 selector I could use with hxselect, or some other tool, which would allow to extract one particular table, either based on the content of thead or on its class if it exists ? Or am I stuck with not so simple awk (or maybe perl, as found before submitting) scripting ?

Update : For content-based extraction, perl's HTML::TableExtract does the trick :

#!/usr/bin/env perl

use open ':std', ':encoding(UTF-8)';
use HTML::TableExtract;

# Extract tables based on header content, slice_columns helpful if colspan issues
$te = HTML::TableExtract->new( headers => ['Multi'], slice_columns => 0);
$te->parse_file('mywebpage.html');

# Loop on all matching tables
foreach $ts ($te->tables()) 
{
  # Print table identification
  print "Table (", join(',', $ts->coords), "):\n";

  # Print table content
  foreach $row ($ts->rows) 
  {
    print join(':', @$row), "\n";
  }
}

However in some cases a simple lynx -dump mywebpage.html coupled wih awk or whatever can be just as efficient.

5
  • did you tried to parent selector? $('foo').parent().parent() //this will give you the table that has the foo class in the td Commented Sep 22, 2014 at 8:45
  • I'm afraid it doesn't work with hxselect or hxextract. But anyway the syntax you suggest wouldn't work, so are you thinking about another (command line) tool ? Commented Sep 22, 2014 at 8:56
  • 3
    He's thinking about jQuery, a JavaScript library. You'll have to forgive folks around here for mistakenly assuming any question involving HTML must somehow involve a Web browser and therefore JavaScript, and that jQuery must be in use - it seems to happen all the time for some reason... Commented Sep 22, 2014 at 11:18
  • Well, I guess technically I could use JS from CLI Commented Sep 22, 2014 at 15:54
  • Oh yeah... node.js. Why not? ;) Commented Sep 22, 2014 at 15:55

1 Answer 1

2

This would require a parent selector or a relational selector, which does not as yet exist (and by the time it does exist, hxselect may not implement it because it does not even fully implement the current standard as of this writing). hxextract appears to only retrieve an element by its type and/or class name, so the best it'd do is td.foo, which would return the td only, not its thead or table.

If you are processing this HTML from the command line, you will need a script.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.