How to parse and return a list of links to seperate strings[] or strings?

Question

I have html div class formated accordingly....

<div class="latest-media-images">
    <div class="hdr-article">LATEST IMAGES</div>
        <a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg1" src="http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
                <a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg2" src="http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
                <a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg3" src="http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
                </div>

Now.... Ive been trying to think of different ways to do this.

I want to parse each URL to sepereate strings for each one...

Now i was thinking of some how parsing them into a list and then selecting each one by passing a position?

(If anyone wants to answer this please feel free too)

Or i could do something such as navigating to the div class...

Element latest_images = doc.select("div.latest-media-images");
Elements links = latest_images.getElementsByTag("img");

for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}

I was thinking of this,havent tried it out yet. I will when i get the chance.

But how will i parse each to a seperate string or a whole list using the code?(if its correct)

Feel free to leave suggestions or answers =) or let me know if the code i have above will do the trick.

Thanks, coder-For-Life22

Thanks for your response. And Yes i would like to get each of the URL's. If not possible maybe just the hrefs. (If its easier this way) As long i can get atleas 3 URL's. — coder_For_Life22
– coder_For_Life22, Commented Sep 19, 2011 at 19:20

Andrei LED · Accepted Answer · 2011-09-20 05:19:43Z

2

Here goes code sample to extract all img urls from your html using RegEx:

//I used your html with some obfuscations to test some fringe cases.
    final String HTML
            = "<div class=\"latest-media-images\">\n"
            + "<div class=\"hdr-article\">LATEST IMAGES</div>\n"
            + "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg1\" \n "
            + "src=\"http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
            + "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg2\" src=  \n"
            + "\"http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
            + "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg3\" src "
            + "=    \t \n  "
            + "\"http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
            + "</div>";

    Pattern pattern = Pattern.compile ("<img[^>]*?src\\s*?=\\s*?\\\"([^\\\"]*?)\\\"");
    Matcher matcher = pattern.matcher (HTML);

    List<String> imgUrls = new ArrayList<String> ();
    while (matcher.find ())
    {
        imgUrls.add (matcher.group (1));
    }

    for (String imgUrl : imgUrls) System.out.println (imgUrl);

The output is the same as Sahil Muthoo posted:

http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg

If by using a link to get the html first you mean that you have an url than the only change will be that instead of using a hard-coded String you'll need to load the html first. For example, you can use Java OOB class URL:

new URL ("http://some_address").openConnection ().getInputStream ();

edited Sep 20, 2011 at 5:19

answered Sep 19, 2011 at 20:11

Andrei LED

2,72920 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

coder_For_Life22 Over a year ago

Can you show the ouput, and could you load in the html i have posted in my question?

coder_For_Life22 Over a year ago

Also. I will be using a link to get the html first...I wont be using HTML directly..Dont know if this will change things.

Andrei LED Over a year ago

added some more into the answer

coder_For_Life22 Over a year ago

Nice i like this. Unfortunately i cant mark both answers right. I wish i could

Sahil Muthoo · Accepted Answer · 2011-09-19 19:32:48Z

1

Elements thumbs = doc.select("div.latest-media-images img.latestMediaThumb");
List<String> thumbLinks = new ArrayList<String>(); 
for(Element thumb : thumbs) {
    thumbLinks.add(thumb.attr("src"));
}
for(String thumb : thumbLinks) {
    System.out.println(thumb);
}

Output

http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg

answered Sep 19, 2011 at 19:32

Sahil Muthoo

12.7k2 gold badges32 silver badges39 bronze badges

1 Comment

Sahil Muthoo Over a year ago

No problem. Glad to have helped :)

Andrei LED · Accepted Answer · 2011-09-19 19:36:30Z

-1

Obviously you can parse the html into a DOM tree and extract all "img" nodes using XPath or CSS selector. And then iterating through them fill an array of links. Though your code doesn't exactly do the trick. The cycle is written to work with "a" nodes while the code before it extracts img nodes.

There's also another way: you can extract required data using RegEx which should have better performance and less memory cost.

answered Sep 19, 2011 at 19:36

Andrei LED

2,72920 silver badges25 bronze badges

2 Comments

Sahil Muthoo Over a year ago

Real world html cannot be handled with regular expressions. A forgiving/correcting html parser like Jsoup is needed.

Andrei LED Over a year ago

For this particular simple task (extract urls from img tags) it would be just enough and as I already said more faster.

Collectives™ on Stack Overflow

How to parse and return a list of links to seperate strings[] or strings?

3 Answers 3

4 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related