Parsing string based on different delimiters

Question

< a href=" http://www.google.com " > Google < /a> < br/> //without the spaces

I'm trying to extract the link http://www.google.com as well as the text Google

Why are you trying to parse it yourself? There are many great libraries out there such as Jsoup that can take care of it for you. — stevevls
– stevevls, Commented Nov 21, 2013 at 1:15
Did your professor insist that you use regular expressions to parse this HTML? — Dawood ibn Kareem
– Dawood ibn Kareem, Commented Nov 21, 2013 at 1:36

Adarsh · Accepted Answer · 2013-11-21 01:28:39Z

1

This should do the job.

    String url = "<a href=\"http://www.google.com\">Google</a><br/>";
    String[] separate = url.split("\"");
    String URL = separate[1];
    String text = separate[2].substring(1).split("<")[0];

answered Nov 21, 2013 at 1:28

Adarsh

3,6592 gold badges23 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

akaya · Accepted Answer · 2013-11-21 01:30:03Z

0

You can extract it by using a simple regex. Try this.

String s = "<a href=\"http://www.google.com\">Google</a><br/>";
Pattern pattern = Pattern.compile("<a\\s+href=\"([^\"]*)\">([^<]*)</a>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
}

answered Nov 21, 2013 at 1:30

akaya

1,14610 silver badges27 bronze badges

Comments

Engine Bai · Accepted Answer · 2013-11-21 01:39:10Z

0

I use the filter API in my web crawler, and it works perfectly.

Here is the API code:

public static String filterHref( String hrefLine )
{
    String link = hrefLine;
    if ( !link.toLowerCase().contains( "href" ) )
        return "";
    String[] hrefSplit = hrefLine.split( "href" ); // split href="..." alt="...">...<...>

    link = hrefSplit[ 1 ].split( "\\s+" )[ 0 ]; // get href attribute and value
    if ( link.contains( ">" ) )
        link = link.substring( 0, link.indexOf( ">" ) );
    link = link.replaceFirst( "=", "" );
    link = link.replace( "\"", "" ).replace( "'", "" ).trim();
    return link;
}

answered Nov 21, 2013 at 1:39

Engine Bai

6361 gold badge7 silver badges16 bronze badges

Collectives™ on Stack Overflow

Parsing string based on different delimiters

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related