-2

I'm parsing an xml file with saxParser on java. My problem is that I have some rows like this:

<name xml:lang="en">Particulates, < 2.5 um</name>

I don't report all the code but if the tag == name I set the name on my object.

    @Override
public void characters(char[] ch, int start, int length) throws SAXException {
    if (isElementaryExchange && isName ) {
        String name = new String(ch, start, length);
        this.currentElementaryFlowBase.setName(name);
    }

The problem is that the result is name=" 2.5 um" because I think that the "<" broke something. There's a way to parse correctly that row? Thanks


EDIT Solved with a Stringbuilder: Append on characters method and set the result only at the end of the element!

4
  • 1
    I down voted because No research lmgtfy.app/?q=excaping+XML+special+cars+ Commented Jun 23, 2021 at 16:46
  • Sorry but I can explain, I cannot modify xml files with escape characters because I have more than 17 million of files, and I'm not authorized to modify these xml, so I need to solve the issue with sax parser (I cannot change the parser) Commented Jun 24, 2021 at 7:02
  • "I cannot modify xml files" -- Your files are not well formed and therefore no proper implemented XML tool will process them. Commented Jun 24, 2021 at 17:02
  • Solved with a Stringbuilder: Append on characters method and set the result only at the end of the element. I understant your point, but I'm not the boss, and if the boss asked to me to solve problems I need to solve it. I asked if there is a way, not the best practice, because I have 20 millions of these xml, and the point is: failing a big project or try to solve the issue. And the solution is quite simple, so why not? Commented Jun 27, 2021 at 15:35

1 Answer 1

1

The "less than" char < is not escaped, so the XML is invalid.
See Section 2.4 at the W3C XML definition:

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively.

Or, in RegEx terms:

CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

So you have to escape the < to get a valid XML (e.g. with &lt;). Otherwise your input file is not valid XML, and you have to complain to its creator for any follow-up problems.

Sign up to request clarification or add additional context in comments.

3 Comments

Yes, but the problem is that I have more than 17 million of xml files to parse, and I cannot modify these files, so I need to solve the issue with the parser.
So you mean that you have 17 million of erroneous XML files? That's an interesting task. Writing a non-standard-parser for this is...well...I don't envy you at all... I'm out.
Solved with a StreamBuilder. Append on characters method and set the result only at the end of the element

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.