< a href=" http://www.google.com " > Google < /a> < br/> //without the spaces
I'm trying to extract the link http://www.google.com as well as the text Google
< a href=" http://www.google.com " > Google < /a> < br/> //without the spaces
I'm trying to extract the link http://www.google.com as well as the text Google
You can extract it by using a simple regex. Try this.
String s = "<a href=\"http://www.google.com\">Google</a><br/>";
Pattern pattern = Pattern.compile("<a\\s+href=\"([^\"]*)\">([^<]*)</a>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
I use the filter API in my web crawler, and it works perfectly.
Here is the API code:
public static String filterHref( String hrefLine )
{
String link = hrefLine;
if ( !link.toLowerCase().contains( "href" ) )
return "";
String[] hrefSplit = hrefLine.split( "href" ); // split href="..." alt="...">...<...>
link = hrefSplit[ 1 ].split( "\\s+" )[ 0 ]; // get href attribute and value
if ( link.contains( ">" ) )
link = link.substring( 0, link.indexOf( ">" ) );
link = link.replaceFirst( "=", "" );
link = link.replace( "\"", "" ).replace( "'", "" ).trim();
return link;
}