1

I am trying to extract text from a string of html tags with content.

For example:

<CalaisSimpleOutputFormat>
  <Country count="13" relevance="0.771" normalized="China">China</Country>
  <Country count="4" relevance="0.598">Taiwan</Country>
  <City count="3" relevance="0.491" normalized="Beijing,China">Beijing</City>
  <NaturalFeature count="3" relevance="0.415">Yellow river</NaturalFeature>
  <Organization count="2" relevance="0.491">Communist Party</Organization>
  <Region count="2" relevance="0.258">Central Asia</Region>
  <Region count="2" relevance="0.315">East Asia</Region>
  <City count="1" relevance="0.304" normalized="Shanghai,China">Shanghai</City>
  <City count="1" relevance="0.304" normalized="Chongqing,China">Chongqing</City>
  <City count="1" relevance="0.101" normalized="Taipei,Taiwan">Taipei</City>
  <City count="1" relevance="0.304" normalized="Tianjin,China">Tianjin</City>
  <Continent count="1" relevance="0.053">Asia</Continent>
  <Country count="1" relevance="0.101" normalized="Japan">Japan</Country>
  <Country count="1" relevance="0.304" normalized="Macau">Macau</Country>
  <MedicalCondition count="1" relevance="0.160">hereditary monarchies</MedicalCondition>
  <NaturalFeature count="1" relevance="0.254">Himalaya</NaturalFeature>
  <NaturalFeature count="1" relevance="0.274">Gobi desert</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">Yellow sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">Pacific Ocean</NaturalFeature>
  <NaturalFeature count="1" relevance="0.291">Great Lakes</NaturalFeature>
  <NaturalFeature count="1" relevance="0.231">Yangtze river</NaturalFeature>
  <NaturalFeature count="1" relevance="0.274">Taklamakan desert</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">South China sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.231">Tibetan Plateau</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">Bohai sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">East sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.254">Tian Shan mountain ranges</NaturalFeature>
  <Organization count="1" relevance="0.062">G-20</Organization>
  <Organization count="1" relevance="0.073">U.N. Security Council</Organization>
  <Organization count="1" relevance="0.062">APEC</Organization>
  <Organization count="1" relevance="0.062">BRICS</Organization>
  <Organization count="1" relevance="0.062">BCIM</Organization>
  <Organization count="1" relevance="0.073">United Nations</Organization>
  <Organization count="1" relevance="0.062">Shanghai Cooperation Organisation</Organization>
  <Organization count="1" relevance="0.062">World Trade Organization</Organization>
  <Organization count="1" relevance="0.105">ROC government</Organization>
  <Position count="1" relevance="0.073">permanent member</Position>
  <Region count="1" relevance="0.208">East China</Region>
  <Region count="1" relevance="0.208">South China</Region>
  <Region count="1" relevance="0.254">South Asia</Region>
  <Region count="1" relevance="0.184">North China</Region>
  <Topics>
     <Topic Taxonomy="Calais" Score="0.558">Politics</Topic>
     <Topic Taxonomy="Calais" Score="0.534">War_Conflict</Topic>
  </Topics>
</CalaisSimpleOutputFormat>

The code has been extract the text successfully from thoes tags with the out put of:

ChinaChongqingShanghaiTaipeiTianjin................

I am wondering if there is a way to extract text one by one or split it with space, so that I can store that into a list. For example:

China
Chongqing
Shanghai
Taipei
......

I have tried the codes like:

Document doc = Jsoup.parse(html);
for (Element a : doc.select("CalaisSimpleOutputFormat")) {
    System.out.println(a.text());
}

and

for (Node child : XX.childNodes()) {
    if (child instanceof TextNode) {
        System.out.println(((TextNode) child).text());
    }
}

and

Document doc = Jsoup.parse(html);
Element start = doc.select("CalaisSimpleOutputFormat").first();
String text = start.text();

Both not working... Any suggestions?

2
  • 1
    which code that you have tried giving you the output "ShanghaiHimalayaHimalayaHimalaya" Commented Mar 21, 2014 at 13:08
  • the last one piece, I've updated my post. Commented Mar 21, 2014 at 13:24

1 Answer 1

1

This program is saving your require data to a ArrayList object

 package com.loknath.lab;


/*
*@Author Loknath 
*/

 import java.io.FileNotFoundException;
 import java.util.ArrayList;
 import org.jsoup.Jsoup;
 import org.jsoup.nodes.Document;
 import org.jsoup.nodes.Element;
 import org.jsoup.parser.Tag;
 import org.jsoup.select.Elements;

 public class Test {

public static void main(String[] args) {
    ArrayList list = new ArrayList();
    Test test = new Test();
    String file = "OCtest.txt";
    try {
        list = test.entityExtractionByFile(file);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(list);
}

public ArrayList entityExtractionByFile(String fileLocation)
        throws FileNotFoundException {
    String content;
    ArrayList list = new ArrayList();
    // You may want to change to sth else to read in the file as string
    FileToString fileIn = new FileToString();
    content = fileIn.convertFile(fileLocation);

    Document doc = Jsoup.parse(content);

    Element element = doc.select("CalaisSimpleOutputFormat").first();
    Elements divChildren = element.children();

    Elements detachedDivChildren = new Elements();
    for (Element elem : divChildren) {
        Element detachedChild = new Element(Tag.valueOf(elem.tagName()),
                elem.baseUri(), elem.attributes().clone());
        detachedDivChildren.add(detachedChild);
    }
    for (Element elem : divChildren) {

        list.add(elem.ownText());
        System.out.println(elem.ownText());

    }
    return list;
}
}

Output:

China
Taiwan
Beijing
.
 .
.
.

for whole source code [click here...]

Sign up to request clarification or add additional context in comments.

8 Comments

Could you be more specific?
see your program is giving output as "ChinaChongqingShanghaiTaipeiTianjin................" right and you want china chongqing shanghai ....dis way am i right?
Yes, and then, store each of the word in a list for further usages
then in this code I have created a ArrayList object ..your "System.out.println(a.text());" line giving you output China Chongqing Shanghai Taipei ...... dis ..if replace this line with list.add(a.text()); then all this name will added to the list object and later you can use
Emm...The answer is no. I tried the code. The a.text() will return all of the words as one word. My question is, how to split it into one by one...Sorry about the confusion...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.