Parse HTML table without IDs or CSS selectors in Node.js

Question

This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).

What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.

Thanks.

<html>
<body>
    <table width="100%" cellspacing="0" cellpadding="5" align="center">
        <tr> 
        <td align="left"><font size="+1"><strong>Filename</strong></font></td>
        <td align="center"><font size="+1"><strong>Size</strong></font></td>
        <td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file1.csv</tt></a></td>
        <td align="right"><tt>86.6 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file2.csv</tt></a></td>
        <td align="right"><tt>20.7 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file1.xml</tt></a></td>
        <td align="right"><tt>266.5 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file2.xml</tt></a></td>
        <td align="right"><tt>27.2 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
    </table>
</body>
</html>

Answer (thanks @Enragedmrt):

    res.on('data', function(data) {

        $ = cheerio.load(data.toString());
        var data = [];
        $('tr').each(function(i, tr){

            var children = $(this).children();
            var fileItem = children.eq(0);
            var linkItem = children.eq(0).children().eq(0);
            var lastModifiedItem = children.eq(2);

            var row = {
                "Filename": fileItem.text().trim(),
                "Link": linkItem.attr("href"),
                "LastModified": lastModifiedItem.text().trim()
            };
            data.push(row);
            console.log(row);
        });
    });

Grant Amos · Accepted Answer · 2014-03-21 21:41:02Z

8

I would suggest using Cheerio over JSDOM as it's significantly faster and more lightweight. That said, you'll need to do a for each loop grabbing up the 'tr' elements and subsequently their 'td' elements. Here's a rough example (My Node.js/Cheerio is rusty, but if you dig around in JQuery you can find some decent examples):

var data = [];
$('tr').each(function(i, tr){
    var children = $(this).children();
    var row = {
        "Filename": children[0].text(),
        "Size": children[1].text(),
        "Last Modified": children[2].text()
    };
    data.push(row);
});

answered Mar 21, 2014 at 21:41

Grant Amos

2,27617 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shawn Over a year ago

This was perfect - only change was that I had to use the .eq(N) to get the Nth child. The [] notation doesn't seem to work with cheerio. And yep - indeed a lot faster than jsdom when parsing the real dataset. Thanks Enragedmrt!

GolezTrol · Accepted Answer · 2014-03-21 21:39:06Z

I don't know JSDom, but it sounds like it can parse a HTML document into a DOM (Document Object Model). From there it should be very possible to loop through the nodes and recognise them by tag name, attributes or position in the document, even if they don't have ids.

Googling for 5 seconds, please hold on...

JSDom's documentation on GitHub seems to confirm this. It shows jQuery-like selectors, like window.$("a.the-link").text(). So instead of adding a class, you can select for selectors like td, th, or probably even td[align="left"]. Using selectors like that, and convenient methods like .first and .each, to traverse over multiple results (like every row) you should be able to parse the document just fine, although it will of course be a bit more cumbersome than having convenient classnames for every different kind of cell.

I still don't think I'm a JSDom expert, but reading their project's main page for a couple of minutes already shows all the answers to your questions, and much more.

Joey · Accepted Answer · 2014-03-21 21:47:17Z

0

JSFiddle

var rawData = new Array();
var rows = document.getElementsByTagName('tr');
for(var cnt = 1; cnt < rows.length; cnt++) {
    var cells = rows[cnt].getElementsByTagName('tt');
    var row = [];
    for (var count = 0; count < cells.length; count++) {
        row.push(cells[count].innerText.trim());
    }    
    rawData.push(row);
}

console.log(rawData);

answered Mar 21, 2014 at 21:47

Joey

4775 silver badges10 bronze badges

Comments

August Jaime · Accepted Answer · 2016-02-03 11:36:30Z

Additional way

var cheerio = require('cheerio'),
    cheerioTableparser = require('cheerio-tableparser');

res.on('data', function(data) {

    $ = cheerio.load(data.toString());
    cheerioTableparser($);
    var data = [];
    var array = $("table").parsetable(false, false, false)
    array[0].forEach(function(d, i) {

        var firstColumnHTMLCell = $("<div>" + array[0][i] + "</div>");
        var fileItem = firstColumnHTMLCell.text().trim();
        var linkItem = firstColumnHTMLCell.find("a").attr("href");
        var lastModifiedItem = $("<div>" + array[2][i] + "</div>").text();

        var row = {
            "Filename": fileItem,
            "Link": linkItem,
            "LastModified": lastModifiedItem
        };

        data.push(row);
        console.log(row);
    })
});

Collectives™ on Stack Overflow

Parse HTML table without IDs or CSS selectors in Node.js

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related