0

I am trying to parse pages (any page dynamic parser). code is

Elements title = doc.select("title");
Elements metades = doc.select("meta[name=description]");

As you can see i want to extract title tag.

It is working fine on approx every website for example hinddroid.com But it unable to parse Title from google.com and youtube.com I think it is due to no space between two tags. Most of big website not have space in html to save bandwidth. Please suggest me - i want to parse html from website.

Full code :

import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import java.sql.*;
import java.util.regex.*;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class post_link extends HttpServlet 
{
@Override
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException
{

response.setContentType("text/html");
PrintWriter out = response.getWriter();

try 
{
//out.println("<link rel=\"stylesheet\" type=\"text/css\" href=\"style.css\" /><script src=\"http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.6.3.min.js\"></script><script src=\"jquery-social.js\"></script>");
String linktopro = "http://"+request.getParameter("link_topro");
//String linktopro = "http://hinddroid.com";
Document doc = Jsoup.connect(linktopro).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").timeout(3000).get();
Elements png = doc.select("img[src]");
Elements title = doc.select("title:first-child");
//Elements title = doc.title();
Elements metades = doc.select("meta[name=description]");
Pattern p1 = Pattern.compile("http://.*|.com*?.(com)");

out.println("<script> var myCars=new Array(");

for(Element pngs : png)
{
Matcher m1 = p1.matcher(pngs.attr("src"));
boolean url = m1.matches();
String baseurl = "";
//out.println(url+"");
if(url)
{ baseurl = ""; }
else
{ baseurl = linktopro; }

out.println("\""+baseurl+""+pngs.attr("src")+"\",");
}
out.println("\"\"");
out.println(");</script>");

String outlink = "<div class=\"linkembox\">"+
"<div class=\"linkembox-img\">"+
"<img src=\"http://hinddroid.com/img/logo.gif\" width=\"150\" height=\"120\" />"+
"<br/><div id=\"linkimg-left\"><</div><div id=\"linkimg-right\">></div>"+
"</div>"+
"<div class=\"linkembox-text\">"+
"<div class=\"h\">"+title.html()+"</div><br/>"+
"<div class=\"h1\">"+metades.attr("content")+"</div>"+
"</div>"+
"</div>";
out.println(outlink);
out.print("<script> left(myCars); </script>");




}
catch(Exception ex)
{
out.print(ex);
} 
finally 
{
out.close();
}

}


}
3
  • Jsoup should work on any well-formed document. It shouldn't fail to parse titles from google and youtube. Paste your full code so that I can help you. Commented Feb 28, 2013 at 8:47
  • after getting the page, doc.title() should work fine for getting the title of the page. Commented Feb 28, 2013 at 8:55
  • Dear deadlock please review code and shoshi i am going to try your solution Commented Feb 28, 2013 at 9:23

1 Answer 1

1

I execute the selectors, it's fine. No problem at all!

public static void main( String[] args ) throws IOException
{
    Document doc = Jsoup.connect("http://facebook.com").get();

    System.out.println("Title: " + doc.title());
    System.out.println("Meta Description: " + doc.select("meta[name=description]").first().attr("content"));

}

With google.com, you can get only <title>, not <meta name=description... because it's not in HTML source.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.