5

This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).

What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.

Thanks.

<html>
<body>
    <table width="100%" cellspacing="0" cellpadding="5" align="center">
        <tr> 
        <td align="left"><font size="+1"><strong>Filename</strong></font></td>
        <td align="center"><font size="+1"><strong>Size</strong></font></td>
        <td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file1.csv</tt></a></td>
        <td align="right"><tt>86.6 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file2.csv</tt></a></td>
        <td align="right"><tt>20.7 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file1.xml</tt></a></td>
        <td align="right"><tt>266.5 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file2.xml</tt></a></td>
        <td align="right"><tt>27.2 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
    </table>
</body>
</html>

Answer (thanks @Enragedmrt):

    res.on('data', function(data) {

        $ = cheerio.load(data.toString());
        var data = [];
        $('tr').each(function(i, tr){

            var children = $(this).children();
            var fileItem = children.eq(0);
            var linkItem = children.eq(0).children().eq(0);
            var lastModifiedItem = children.eq(2);

            var row = {
                "Filename": fileItem.text().trim(),
                "Link": linkItem.attr("href"),
                "LastModified": lastModifiedItem.text().trim()
            };
            data.push(row);
            console.log(row);
        });
    });
0

4 Answers 4

8

I would suggest using Cheerio over JSDOM as it's significantly faster and more lightweight. That said, you'll need to do a for each loop grabbing up the 'tr' elements and subsequently their 'td' elements. Here's a rough example (My Node.js/Cheerio is rusty, but if you dig around in JQuery you can find some decent examples):

var data = [];
$('tr').each(function(i, tr){
    var children = $(this).children();
    var row = {
        "Filename": children[0].text(),
        "Size": children[1].text(),
        "Last Modified": children[2].text()
    };
    data.push(row);
});
Sign up to request clarification or add additional context in comments.

1 Comment

This was perfect - only change was that I had to use the .eq(N) to get the Nth child. The [] notation doesn't seem to work with cheerio. And yep - indeed a lot faster than jsdom when parsing the real dataset. Thanks Enragedmrt!
0

I don't know JSDom, but it sounds like it can parse a HTML document into a DOM (Document Object Model). From there it should be very possible to loop through the nodes and recognise them by tag name, attributes or position in the document, even if they don't have ids.

Googling for 5 seconds, please hold on...

JSDom's documentation on GitHub seems to confirm this. It shows jQuery-like selectors, like window.$("a.the-link").text(). So instead of adding a class, you can select for selectors like td, th, or probably even td[align="left"]. Using selectors like that, and convenient methods like .first and .each, to traverse over multiple results (like every row) you should be able to parse the document just fine, although it will of course be a bit more cumbersome than having convenient classnames for every different kind of cell.

I still don't think I'm a JSDom expert, but reading their project's main page for a couple of minutes already shows all the answers to your questions, and much more.

Comments

0

JSFiddle

var rawData = new Array();
var rows = document.getElementsByTagName('tr');
for(var cnt = 1; cnt < rows.length; cnt++) {
    var cells = rows[cnt].getElementsByTagName('tt');
    var row = [];
    for (var count = 0; count < cells.length; count++) {
        row.push(cells[count].innerText.trim());
    }    
    rawData.push(row);
}

console.log(rawData);

Comments

0

Additional way

var cheerio = require('cheerio'),
    cheerioTableparser = require('cheerio-tableparser');

res.on('data', function(data) {

    $ = cheerio.load(data.toString());
    cheerioTableparser($);
    var data = [];
    var array = $("table").parsetable(false, false, false)
    array[0].forEach(function(d, i) {

        var firstColumnHTMLCell = $("<div>" + array[0][i] + "</div>");
        var fileItem = firstColumnHTMLCell.text().trim();
        var linkItem = firstColumnHTMLCell.find("a").attr("href");
        var lastModifiedItem = $("<div>" + array[2][i] + "</div>").text();

        var row = {
            "Filename": fileItem,
            "Link": linkItem,
            "LastModified": lastModifiedItem
        };

        data.push(row);
        console.log(row);
    })
});

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.