This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).
What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.
Thanks.
<html>
<body>
<table width="100%" cellspacing="0" cellpadding="5" align="center">
<tr>
<td align="left"><font size="+1"><strong>Filename</strong></font></td>
<td align="center"><font size="+1"><strong>Size</strong></font></td>
<td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
</tr>
<tr>
<td align="left">
<a href="/path_to_file.csv"><tt>file1.csv</tt></a></td>
<td align="right"><tt>86.6 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
<tr bgcolor="#eeeeee">
<td align="left">
<a href="/path_to_file.csv"><tt>file2.csv</tt></a></td>
<td align="right"><tt>20.7 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
<tr>
<td align="left">
<a href="/path_to_file.xml"><tt>file1.xml</tt></a></td>
<td align="right"><tt>266.5 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
<tr bgcolor="#eeeeee">
<td align="left">
<a href="/path_to_file.xml"><tt>file2.xml</tt></a></td>
<td align="right"><tt>27.2 kb</tt></td>
<td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
</tr>
</table>
</body>
</html>
Answer (thanks @Enragedmrt):
res.on('data', function(data) {
$ = cheerio.load(data.toString());
var data = [];
$('tr').each(function(i, tr){
var children = $(this).children();
var fileItem = children.eq(0);
var linkItem = children.eq(0).children().eq(0);
var lastModifiedItem = children.eq(2);
var row = {
"Filename": fileItem.text().trim(),
"Link": linkItem.attr("href"),
"LastModified": lastModifiedItem.text().trim()
};
data.push(row);
console.log(row);
});
});