0

I have html div class formated accordingly....

<div class="latest-media-images">
    <div class="hdr-article">LATEST IMAGES</div>
        <a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg1" src="http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
                <a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg2" src="http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
                <a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg3" src="http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
                </div>

Now.... Ive been trying to think of different ways to do this.

I want to parse each URL to sepereate strings for each one...

Now i was thinking of some how parsing them into a list and then selecting each one by passing a position?

(If anyone wants to answer this please feel free too)

Or i could do something such as navigating to the div class...

Element latest_images = doc.select("div.latest-media-images");
Elements links = latest_images.getElementsByTag("img");

for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}

I was thinking of this,havent tried it out yet. I will when i get the chance.

But how will i parse each to a seperate string or a whole list using the code?(if its correct)

Feel free to leave suggestions or answers =) or let me know if the code i have above will do the trick.

Thanks, coder-For-Life22

3
  • Do you want the <img> URLs as well? Commented Sep 19, 2011 at 19:18
  • Thanks for your response. And Yes i would like to get each of the URL's. If not possible maybe just the hrefs. (If its easier this way) As long i can get atleas 3 URL's. Commented Sep 19, 2011 at 19:20
  • Actually the <img> URL's is what i need! Yes. Commented Sep 19, 2011 at 19:21

3 Answers 3

2

Here goes code sample to extract all img urls from your html using RegEx:

//I used your html with some obfuscations to test some fringe cases.
    final String HTML
            = "<div class=\"latest-media-images\">\n"
            + "<div class=\"hdr-article\">LATEST IMAGES</div>\n"
            + "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg1\" \n "
            + "src=\"http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
            + "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg2\" src=  \n"
            + "\"http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
            + "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg3\" src "
            + "=    \t \n  "
            + "\"http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
            + "</div>";

    Pattern pattern = Pattern.compile ("<img[^>]*?src\\s*?=\\s*?\\\"([^\\\"]*?)\\\"");
    Matcher matcher = pattern.matcher (HTML);

    List<String> imgUrls = new ArrayList<String> ();
    while (matcher.find ())
    {
        imgUrls.add (matcher.group (1));
    }

    for (String imgUrl : imgUrls) System.out.println (imgUrl);

The output is the same as Sahil Muthoo posted:

http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg

If by using a link to get the html first you mean that you have an url than the only change will be that instead of using a hard-coded String you'll need to load the html first. For example, you can use Java OOB class URL:

new URL ("http://some_address").openConnection ().getInputStream ();
Sign up to request clarification or add additional context in comments.

4 Comments

Can you show the ouput, and could you load in the html i have posted in my question?
Also. I will be using a link to get the html first...I wont be using HTML directly..Dont know if this will change things.
added some more into the answer
Nice i like this. Unfortunately i cant mark both answers right. I wish i could
1
Elements thumbs = doc.select("div.latest-media-images img.latestMediaThumb");
List<String> thumbLinks = new ArrayList<String>(); 
for(Element thumb : thumbs) {
    thumbLinks.add(thumb.attr("src"));
}
for(String thumb : thumbLinks) {
    System.out.println(thumb);
}

Output

http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg

1 Comment

No problem. Glad to have helped :)
-1

Obviously you can parse the html into a DOM tree and extract all "img" nodes using XPath or CSS selector. And then iterating through them fill an array of links. Though your code doesn't exactly do the trick. The cycle is written to work with "a" nodes while the code before it extracts img nodes.

There's also another way: you can extract required data using RegEx which should have better performance and less memory cost.

2 Comments

Real world html cannot be handled with regular expressions. A forgiving/correcting html parser like Jsoup is needed.
For this particular simple task (extract urls from img tags) it would be just enough and as I already said more faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.