0

This is my problem: i need to extract the text between the tag "p" without the XML notation using SAX Parser

    <title>1. Introduction</title>
    <p>The Lorem ipsum 
           <xref ref-type="bibr" rid="B1">
                1
           </xref>. 
           Lorem ipsum 23.
     </p>
     <p>The L domain recruits an ATP-requiring cellular factor for this 
           scission event, the only known energy-dependent step in assembly 
           <xref ref-type="bibr" rid="B2">
                2
           </xref>. 
           Domain is used here to denote the amino 
           acid sequence that constitutes the biological function.
     </p>

Is it possible using endElement() ? Because when i use it i obtain only the part after "/xref" tag

Here the code

public void endElement(String s, String s1, String element) throws SAXException {

        if(element.equals(Finals.PARAGRAPH)){
            Paragraph paragraph = new Paragraph();
            paragraph.setContext(tmpValue);
            System.out.println("Contesto: " + tmpValue);
            listP.add(paragraph);

        }
    }
    @Override
    public void characters(char[] ac, int i, int j) throws SAXException {
        tmpValue = new String(ac, i, j);

    }

This is what i expect to do: a list listP containing the two paragraphs:

1) Lorem ipsum 1 Lorem ipsum 23.
2) The L domain recruits an ATP-requiring cellular factor for this 
       scission event, the only known energy-dependent step in assembly 2 
       Domain is used here to denote the amino 
       acid sequence that constitutes the biological function.
3
  • Because obviously endElement is invoked on ... ending elements. You are interested in a section called CDATA. You should find the appropriate handler for this. And you should present your current attempt using your actual code. Commented Jan 5, 2014 at 18:22
  • Seems you're doing fine. Where's the problem? Commented Jan 5, 2014 at 20:53
  • I need this result The L domain recruits an ATP-requiring cellular factor for this scission event, the only known energy-dependent step in assembly 2. Domain is used here to denote the amino acid sequence that constitutes the biological function. but i get only Domain is used here to denote the amino acid sequence that constitutes the biological function. Commented Jan 5, 2014 at 21:08

3 Answers 3

2

I'm not sure what you mean by "is it possible using endElement", but it's certainly possible. You'd need to write your SAX application so it:

(1) ignores all startElement/endElement events between the ones for the <p>aragraph -- simple state tracking, or perhaps you can simply say that you aren't interested in elements other than paragraphs and make your element event handlers be no-ops for anything you don't care about.

(2) accumulates separately-delivered characters() events until the endElement for the <p>aragraph. But you need to do this anyway, because SAX always reserves the right to deliver contiguous text as several characters() calls, for reasons having to do with parser buffer management.

Sign up to request clarification or add additional context in comments.

Comments

0

There are many possible solutions. Usually using SAX parsers you just add some boolean flags to denote some particular states when parsing. In this simple example you can achieve this with just changing this:

tmpValue = new String(ac, i, j);

to this:

if (tmpValue.equals(""))
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

or:

if (tmpValue == null)
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

Depending on how you initialize the tmpValue variable (and you should initialize it if you're not doing it already).

To gather contents of all paragraphs you need to:

public void endElement(String s, String s1, String element) throws SAXException {

    if (element.equals(Finals.PARAGRAPH)) {
        Paragraph paragraph = new Paragraph();
        paragraph.setContext(tmpValue);
        System.out.println("Contesto: " + tmpValue);
        listP.add(paragraph);
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

and to omit the title part:

public void startElement(
    String uri,
    String localName,
    String qName,
    Attributes atts) {

    if (localName.equals(Finals.PARAGRAPH)) {
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

10 Comments

I get nullPointerException adding that solution.
@user3162945 Please be specific. I've provided two solutions. Also do have tmpValue initialized, as I have suggested?
I forgot to initialize tmpValue. Now works but i didn't get the full string. Only the part after /xref
@user3162945 I misunderstood your initial requirement. Check this edit.
It works but it takes everything in the xml. I need only the part in p tag. Check my edit.
|
0

Use a stack
Push in startElement events and Pop in endElement events.

Or if that doesn't work for you, just Push into the stack for all events and then after endOfDocument, Pop the elements one by one. Store the data from </p> to <p> in reverse.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.