SAX Parser - Extract string within tags

Question

This is my problem: i need to extract the text between the tag "p" without the XML notation using SAX Parser

    <title>1. Introduction</title>
    <p>The Lorem ipsum 
           <xref ref-type="bibr" rid="B1">
                1
           </xref>. 
           Lorem ipsum 23.
     </p>
     <p>The L domain recruits an ATP-requiring cellular factor for this 
           scission event, the only known energy-dependent step in assembly 
           <xref ref-type="bibr" rid="B2">
                2
           </xref>. 
           Domain is used here to denote the amino 
           acid sequence that constitutes the biological function.
     </p>

Is it possible using endElement() ? Because when i use it i obtain only the part after "/xref" tag

Here the code

public void endElement(String s, String s1, String element) throws SAXException {

        if(element.equals(Finals.PARAGRAPH)){
            Paragraph paragraph = new Paragraph();
            paragraph.setContext(tmpValue);
            System.out.println("Contesto: " + tmpValue);
            listP.add(paragraph);

        }
    }
    @Override
    public void characters(char[] ac, int i, int j) throws SAXException {
        tmpValue = new String(ac, i, j);

    }

This is what i expect to do: a list listP containing the two paragraphs:

1) Lorem ipsum 1 Lorem ipsum 23.
2) The L domain recruits an ATP-requiring cellular factor for this 
       scission event, the only known energy-dependent step in assembly 2 
       Domain is used here to denote the amino 
       acid sequence that constitutes the biological function.

Because obviously endElement is invoked on ... ending elements. You are interested in a section called CDATA. You should find the appropriate handler for this. And you should present your current attempt using your actual code. — BartoszKP
– BartoszKP, Commented Jan 5, 2014 at 18:22
I need this result The L domain recruits an ATP-requiring cellular factor for this scission event, the only known energy-dependent step in assembly 2. Domain is used here to denote the amino acid sequence that constitutes the biological function. but i get only Domain is used here to denote the amino acid sequence that constitutes the biological function. — user3162945
– user3162945, Commented Jan 5, 2014 at 21:08

keshlam · Accepted Answer · 2014-01-05 18:24:49Z

2

I'm not sure what you mean by "is it possible using endElement", but it's certainly possible. You'd need to write your SAX application so it:

(1) ignores all startElement/endElement events between the ones for the <p>aragraph -- simple state tracking, or perhaps you can simply say that you aren't interested in elements other than paragraphs and make your element event handlers be no-ops for anything you don't care about.

(2) accumulates separately-delivered characters() events until the endElement for the <p>aragraph. But you need to do this anyway, because SAX always reserves the right to deliver contiguous text as several characters() calls, for reasons having to do with parser buffer management.

answered Jan 5, 2014 at 18:24

keshlam

8,0762 gold badges22 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BartoszKP · Accepted Answer · 2014-01-06 14:10:29Z

0

There are many possible solutions. Usually using SAX parsers you just add some boolean flags to denote some particular states when parsing. In this simple example you can achieve this with just changing this:

tmpValue = new String(ac, i, j);

to this:

if (tmpValue.equals(""))
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

or:

if (tmpValue == null)
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

Depending on how you initialize the tmpValue variable (and you should initialize it if you're not doing it already).

To gather contents of all paragraphs you need to:

public void endElement(String s, String s1, String element) throws SAXException {

    if (element.equals(Finals.PARAGRAPH)) {
        Paragraph paragraph = new Paragraph();
        paragraph.setContext(tmpValue);
        System.out.println("Contesto: " + tmpValue);
        listP.add(paragraph);
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

and to omit the title part:

public void startElement(
    String uri,
    String localName,
    String qName,
    Attributes atts) {

    if (localName.equals(Finals.PARAGRAPH)) {
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

edited Jan 6, 2014 at 14:10

answered Jan 5, 2014 at 21:14

BartoszKP

36k15 gold badges109 silver badges135 bronze badges

10 Comments

user3162945 Over a year ago

I get nullPointerException adding that solution.

BartoszKP Over a year ago

@user3162945 Please be specific. I've provided two solutions. Also do have tmpValue initialized, as I have suggested?

user3162945 Over a year ago

I forgot to initialize tmpValue. Now works but i didn't get the full string. Only the part after /xref

BartoszKP Over a year ago

@user3162945 I misunderstood your initial requirement. Check this edit.

user3162945 Over a year ago

It works but it takes everything in the xml. I need only the part in p tag. Check my edit.

|

Gavin · Accepted Answer · 2014-01-28 12:00:13Z

0

Use a stack
Push in startElement events and Pop in endElement events.

Or if that doesn't work for you, just Push into the stack for all events and then after endOfDocument, Pop the elements one by one. Store the data from </p> to <p> in reverse.

answered Jan 28, 2014 at 12:00

Gavin

911 silver badge9 bronze badges

Collectives™ on Stack Overflow

SAX Parser - Extract string within tags

3 Answers 3

Comments

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related