Extract text from HTML tag one by one with Jsoup

Question

I am trying to extract text from a string of html tags with content.

For example:

<CalaisSimpleOutputFormat>
  <Country count="13" relevance="0.771" normalized="China">China</Country>
  <Country count="4" relevance="0.598">Taiwan</Country>
  <City count="3" relevance="0.491" normalized="Beijing,China">Beijing</City>
  <NaturalFeature count="3" relevance="0.415">Yellow river</NaturalFeature>
  <Organization count="2" relevance="0.491">Communist Party</Organization>
  <Region count="2" relevance="0.258">Central Asia</Region>
  <Region count="2" relevance="0.315">East Asia</Region>
  <City count="1" relevance="0.304" normalized="Shanghai,China">Shanghai</City>
  <City count="1" relevance="0.304" normalized="Chongqing,China">Chongqing</City>
  <City count="1" relevance="0.101" normalized="Taipei,Taiwan">Taipei</City>
  <City count="1" relevance="0.304" normalized="Tianjin,China">Tianjin</City>
  <Continent count="1" relevance="0.053">Asia</Continent>
  <Country count="1" relevance="0.101" normalized="Japan">Japan</Country>
  <Country count="1" relevance="0.304" normalized="Macau">Macau</Country>
  <MedicalCondition count="1" relevance="0.160">hereditary monarchies</MedicalCondition>
  <NaturalFeature count="1" relevance="0.254">Himalaya</NaturalFeature>
  <NaturalFeature count="1" relevance="0.274">Gobi desert</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">Yellow sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">Pacific Ocean</NaturalFeature>
  <NaturalFeature count="1" relevance="0.291">Great Lakes</NaturalFeature>
  <NaturalFeature count="1" relevance="0.231">Yangtze river</NaturalFeature>
  <NaturalFeature count="1" relevance="0.274">Taklamakan desert</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">South China sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.231">Tibetan Plateau</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">Bohai sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.208">East sea</NaturalFeature>
  <NaturalFeature count="1" relevance="0.254">Tian Shan mountain ranges</NaturalFeature>
  <Organization count="1" relevance="0.062">G-20</Organization>
  <Organization count="1" relevance="0.073">U.N. Security Council</Organization>
  <Organization count="1" relevance="0.062">APEC</Organization>
  <Organization count="1" relevance="0.062">BRICS</Organization>
  <Organization count="1" relevance="0.062">BCIM</Organization>
  <Organization count="1" relevance="0.073">United Nations</Organization>
  <Organization count="1" relevance="0.062">Shanghai Cooperation Organisation</Organization>
  <Organization count="1" relevance="0.062">World Trade Organization</Organization>
  <Organization count="1" relevance="0.105">ROC government</Organization>
  <Position count="1" relevance="0.073">permanent member</Position>
  <Region count="1" relevance="0.208">East China</Region>
  <Region count="1" relevance="0.208">South China</Region>
  <Region count="1" relevance="0.254">South Asia</Region>
  <Region count="1" relevance="0.184">North China</Region>
  <Topics>
     <Topic Taxonomy="Calais" Score="0.558">Politics</Topic>
     <Topic Taxonomy="Calais" Score="0.534">War_Conflict</Topic>
  </Topics>
</CalaisSimpleOutputFormat>

The code has been extract the text successfully from thoes tags with the out put of:

ChinaChongqingShanghaiTaipeiTianjin................

I am wondering if there is a way to extract text one by one or split it with space, so that I can store that into a list. For example:

China
Chongqing
Shanghai
Taipei
......

I have tried the codes like:

Document doc = Jsoup.parse(html);
for (Element a : doc.select("CalaisSimpleOutputFormat")) {
    System.out.println(a.text());
}

and

for (Node child : XX.childNodes()) {
    if (child instanceof TextNode) {
        System.out.println(((TextNode) child).text());
    }
}

and

Document doc = Jsoup.parse(html);
Element start = doc.select("CalaisSimpleOutputFormat").first();
String text = start.text();

Both not working... Any suggestions?

which code that you have tried giving you the output "ShanghaiHimalayaHimalayaHimalaya" — loknath
– loknath, Commented Mar 21, 2014 at 13:08

loknath · Accepted Answer · 2014-03-21 16:50:09Z

1

This program is saving your require data to a ArrayList object

 package com.loknath.lab;


/*
*@Author Loknath 
*/

 import java.io.FileNotFoundException;
 import java.util.ArrayList;
 import org.jsoup.Jsoup;
 import org.jsoup.nodes.Document;
 import org.jsoup.nodes.Element;
 import org.jsoup.parser.Tag;
 import org.jsoup.select.Elements;

 public class Test {

public static void main(String[] args) {
    ArrayList list = new ArrayList();
    Test test = new Test();
    String file = "OCtest.txt";
    try {
        list = test.entityExtractionByFile(file);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(list);
}

public ArrayList entityExtractionByFile(String fileLocation)
        throws FileNotFoundException {
    String content;
    ArrayList list = new ArrayList();
    // You may want to change to sth else to read in the file as string
    FileToString fileIn = new FileToString();
    content = fileIn.convertFile(fileLocation);

    Document doc = Jsoup.parse(content);

    Element element = doc.select("CalaisSimpleOutputFormat").first();
    Elements divChildren = element.children();

    Elements detachedDivChildren = new Elements();
    for (Element elem : divChildren) {
        Element detachedChild = new Element(Tag.valueOf(elem.tagName()),
                elem.baseUri(), elem.attributes().clone());
        detachedDivChildren.add(detachedChild);
    }
    for (Element elem : divChildren) {

        list.add(elem.ownText());
        System.out.println(elem.ownText());

    }
    return list;
}
}

Output:

China
Taiwan
Beijing
.
 .
.
.

for whole source code [click here...]

edited Mar 21, 2014 at 16:50

answered Mar 21, 2014 at 13:18

loknath

1,37216 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Xingsheng Guo Over a year ago

Could you be more specific?

loknath Over a year ago

see your program is giving output as "ChinaChongqingShanghaiTaipeiTianjin................" right and you want china chongqing shanghai ....dis way am i right?

Xingsheng Guo Over a year ago

Yes, and then, store each of the word in a list for further usages

loknath Over a year ago

then in this code I have created a ArrayList object ..your "System.out.println(a.text());" line giving you output China Chongqing Shanghai Taipei ...... dis ..if replace this line with list.add(a.text()); then all this name will added to the list object and later you can use

Xingsheng Guo Over a year ago

Emm...The answer is no. I tried the code. The a.text() will return all of the words as one word. My question is, how to split it into one by one...Sorry about the confusion...

|

Collectives™ on Stack Overflow

Extract text from HTML tag one by one with Jsoup

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related