Scrape certain links from a page with javascript

Question

Here is a sample block of code I need to scrape:

<p>This paragraph contains <a href="http://twitter.com/chsweb" data-placement="below" rel="twipsy" target="_blank" data-original-title="Twitter">links to Twitter folks</a>, and <a href="http://twitter.com/blogcycle" data-placement="below" rel="twipsy" target="_blank" data-original-title="Twitter">more links to other Twitter folks</a>, but it also contains <a href="http://www.someOtherWebsiteHere.com">non-Twitter links too</a>.  How can I list only the Twitter links below?</p>

This script generates a list of every URL on the page:

<script>
var allLinks = document.links;
for (var i=0; i<allLinks.length; i++) {
  document.write(allLinks[i].href+"<BR/>");
}
</script>

How do I modify the script so that it only lists URLs that contain a certain domain, e.g.; twitter.com/?

Here is a demo page: http://chsweb.me/OucTum

Beware of document.write when you loop DOM node collections, the loop will never get past the first node. — David Hellsing
– David Hellsing, Commented Aug 29, 2012 at 13:41

Fabrizio Calderan · Accepted Answer · 2012-08-29 13:50:04Z

1

On modern browser you could easily retrieve all desired links with

var twitter_links = document.querySelectorAll('a[href*="twitter.com"]');

using .querySelectorAll() is a bit penalizing in terms of speed, but probably you won't notice any significative difference and it will make code easier to read and shorter than using a for loop with a regular expression.

edited Aug 29, 2012 at 13:50

answered Aug 29, 2012 at 13:42

Fabrizio Calderan

124k26 gold badges172 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

chsweb Over a year ago

Works beautifully! Here is the working demo: chsweb.me/NCeU6L Thank you Fabrizio.

chsweb Over a year ago

Also, as a follow up, this is only used locally for me to organize links to Twitter folks who commented on a presentation I did. I am using it to thank them and send goodies; it is not a real life use case, but great that it can be done if needed.

BenM · Accepted Answer · 2012-08-29 13:34:19Z

0

The following will place all Twitter links in the twitter_links array:

var twitter_links = [ ],
    links = document.getElementsByTagName('a');
for(var i in links)
{
    if(/twitter.com/i.exec(links[i].href))
    {
        twitter_links.push(links[i]);
    }
}

Here's a jsFiddle for you > http://jsfiddle.net/Pv8DH/

answered Aug 29, 2012 at 13:34

BenM

53.3k26 gold badges116 silver badges172 bronze badges

1 Comment

chsweb Over a year ago

Confirmed to work - puts links into an alert where they can be copied if needed. Thank you.

David Hellsing · Accepted Answer · 2012-08-29 13:40:38Z

0

You can use window.location properties on the link element to extract different parts of the href. f.ex:

var link = allLinks[i];
if ( /twitter\.com/.test( link.hostname ) ) {
    document.write(link.href+"<BR/>");
}

Another issue with your code: If you use document.write in a for loop, it will effectively empty the collection of links, since they are just a reference to the links present in the current document. So it will never get past the first link. Collect them in an array instead:

var links = [];
for (var i=0; i<allLinks.length; i++) {
    var link = allLinks[i];
    if ( /twitter\.com/.test( link.hostname ) ) {
        links.push(link.href);
    }
}

document.write(links.join('<br>'));

Demo: http://jsfiddle.net/3xub6/

edited Aug 29, 2012 at 13:40

answered Aug 29, 2012 at 13:31

David Hellsing

109k44 gold badges181 silver badges214 bronze badges

1 Comment

chsweb Over a year ago

OK, confirmed, this works too. Thank you! There are many similar posts on SO, I hope this helps those folks too. chsweb.me/O2Wtd8

chsweb · Accepted Answer · 2012-08-29 16:30:22Z

0

ORIGINAL: Not working on demo page (Sample 6)

<script>
if (allLinks[i].href.match("twitter\.com"))
{
     document.write(allLinks[i].href+"<BR/>");
}
</script>

REVISED: Is working on demo page (Sample 7)

<script>
var allLinks = document.links;
for (var i=0; i<allLinks.length; i++) {
      if (allLinks[i].href.match("twitter.com")) {
            document.write(allLinks[i].href+"<BR/>");
      }
}
</script>

edited Aug 29, 2012 at 16:30

chsweb

1573 silver badges8 bronze badges

answered Aug 29, 2012 at 13:32

PhonicUK

13.9k4 gold badges47 silver badges65 bronze badges

3 Comments

chsweb Over a year ago

Hmmm, does not seem to work, here is a demo page that uses the script above for reference: chsweb.me/PO2v4o

PhonicUK Over a year ago

Make a jsfiddler out of it instead of trying to host it from Dropbox.

chsweb Over a year ago

Here is another take on your approach which does work on the demo page, here: chsweb.me/NWHuWi

Collectives™ on Stack Overflow

Scrape certain links from a page with javascript

4 Answers 4

2 Comments

1 Comment

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related