0

i am building up a java application to extract the values inside the table tags using xpath.

Please suggest me an efficient way to get all 200 values from the page. my code works perfectly fine for the 100 rows withing the 1st DataTable. However, i have no way to get to the 2nd dataTable.

i am able to extract them using the following java class.

the expected output

http://a.com/   data for a  526735  Z
http://b.com/   data for b  522273  Z
.
.
.
.

http://c.com/   data for c  578335  Z  
http://d.com/   data for d  513445  Z

<table>
<tbody>
 <tr>
 <td style="padding-right>
 <table class = dataTabe>
  <tbody>
   <tr>
    <td><a HREF="http://a.com/" target="_parent">data for a</a></td>
    <td class="numericalColumn">526735</td>
    <td class="numericalColumn">Z</td></tr>
   <tr>
    <td><a HREF="http://b.com/" target="_parent">data for b</a></td>
    <td class="numericalColumn">522273</td>
    <td class="numericalColumn">B</td></tr>
.
.
.100 <tr> here
.
  </tbody>
 </table>
</td>
<td style="padding-right>
 <table class = dataTabe>
  <tbody>
   <tr>
   <td><a HREF="http://c.com/" target="_parent">data for c</a></td>
   <td class="numericalColumn">526735</td>
   <td class="numericalColumn">Z</td></tr>
  <tr>
   <td><a HREF="http://d.com/" target="_parent">data for d</a></td>
   <td class="numericalColumn">522273</td>
   <td class="numericalColumn">B</td></tr>
.
.
.100 rows here
.
  </tbody>
 </table>      
</td>
</tr>
</tbody>
</table>

This is the class used to get the data.

import java.io.BufferedReader;
import java.io.InputStream;
import org.w3c.tidy.*;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Node;
import org.w3c.tidy.Tidy;
import org.w3c.tidy.Tidy;

public class CompaniesGetter {
public static void main(String[] args) throws Exception{
    String name,link,scripcode,group,s,key;
    int a=1;
    int count=1;
    URL oracle = new URL("http://money.rediff.com/companies");
    URLConnection yc = oracle.openConnection();
    InputStream is = yc.getInputStream();
    is = oracle.openStream();
    Tidy tidy = new Tidy();
    tidy.setQuiet(true);
    tidy.setShowWarnings(false);
    Document tidyDOM = tidy.parseDOM(is, null);
    XPathFactory xPathFactory = XPathFactory.newInstance();
    XPath xPath = xPathFactory.newXPath();
    Map<String,String> mLink=new HashMap<String,String>();
    Map<String,String> mCode=new HashMap<String,String>();
    Map<String,String> mGroup=new HashMap<String,String>();
    ArrayList<String> aName=new ArrayList<String>();
    //for(int j=0;j<2;j++)
    for(int i =1;i<=200;i++)
    {if(i==100)
    {
        a=2;
        s=attrib[1];
    }
        link = "//table[@class='dataTable']/tbody/tr["+i+"]/td/a/@href";
        name = "//table[@class='dataTable']/tbody/tr["+i+"]/td/a";
        scripcode = "//table[@class='dataTable']/tbody/tr["+i+"]/td[2]";
        group = "//table[@class='dataTable']/tbody/tr["+i+"]/td[3]";
        String linkValue = (String)xPath.evaluate(link, tidyDOM, XPathConstants.STRING);
        String nameValue = (String)xPath.evaluate(name, tidyDOM, XPathConstants.STRING);
        String scripValue = (String)xPath.evaluate(scripcode, tidyDOM, XPathConstants.STRING);
        String groupValue = (String)xPath.evaluate(group, tidyDOM, XPathConstants.STRING);
        aName.add(nameValue);
        mLink.put(nameValue, linkValue);
        mCode.put(nameValue, scripValue);
        mGroup.put(nameValue,groupValue);
    }
    Iterator<String> itr=aName.iterator();
    while (itr.hasNext()){
        key=itr.next();
        System.out.println("::"+(count++)+" "+key + "  "+mLink.get(key)+"   "+mCode.get(key)+"   "+mGroup.get(key)+" ::");
    }

}

}

1 Answer 1

1

Hm. Just a tip: Do you use the variable "a" in the XPaths?

link = "//table[@class='dataTable']/tbody/tr["+i+"]/td/a/@href";

should be

link = "//table[@class='dataTable'][" + a + "]/tbody/tr["+i+"]/td/a/@href";
Sign up to request clarification or add additional context in comments.

4 Comments

duh!! it didnt struck me. thanks a lot. and what do you say about the code. can i optimize it in some way
Actually yes. I think you should use NodeLists instead of manually paging one by one on the list. And the reasons are : 1. Here, in every cycle your XPaths would be evaulated on the DOM. 2. The count of the table rows may be differ. ( Maybe the number of the input rows will raise dynamically )
I tried using NodeLists in the 1st place, but being new to xpath and jaxp, everything is simply going above the head. it would be helpful if you could elaborate your solution.
The count of the table rows is constant. But the main problem lies in selecting a row and getting d childs values and repeating the procedure for n number of rows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.