Parsing XML getting comment(s) and date value(s) only

Question

Hey all I am trying to see if I can read an XML file and only gather the tags that have the date formatted like YYYY-MM-DD.

Here is an online example: https://repl.it/repls/MedicalIgnorantEfficiency

Here is an example of my xml to parse:

<?xml version="1.0" encoding="UTF-8"?>
<ncc:Message xmlns:ncc="http://blank/1.0.6" 
xmlns:cs="http://blank/1.0.0" 
xmlns:jx="http://blank/1.0.0"
xmlns:jm="http://blank/1.0.0"
xmlns:n-p="http://blank/1.0.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://blank/1.0.6/person person.xsd">
    <ncc:DataSection>
        <ncc:PersonResponse>
            <!-- Message -->
            <cs:CText cs:type="No">NO WANT</cs:CText>
            <jm:CaseID>
                <!-- OEA -->
                <jm:ID>ABC123</jm:ID>
            </jm:CaseID>
            <jx:PersonName>
                <!-- NAM -->
                <jx:GivenName>Arugula</jx:GivenName>
                <jx:MiddleName>Pibb</jx:MiddleName>
                <jx:SurName>Atari</jx:SurName>
            </jx:PersonName>
            <!-- DOB -->
            <ncc:PersonBirthDateText>1948-05-11</ncc:PersonBirthDateText>
            <jx:PersonDetails>
                <!-- SXC -->
                <jx:PersonSSN>
                    <jx:ID/>
                </jx:PersonSSN>
            </jx:PersonDetails>
            <n-p:Activity>
                <!--DOZ-->
                <jx:ActivityDate>1996-04-04</jx:ActivityDate>
                <jx:HomeAgency xsi:type="cs:Organization">
                    <!-- ART -->
                    <jx:Organization>
                        <jx:ID>ZR5981034</jx:ID>
                    </jx:Organization>
                </jx:HomeAgency>
            </n-p:Activity>
            <jx:PersonName>
                <!-- DOB Newest -->
                <ncc:BirthDateText>1993-05-12</ncc:BirthDateText>
                <ncc:BirthDateText>1993-05-13</ncc:BirthDateText>
                <ncc:BirthDateText>1993-05-14</ncc:BirthDateText>
                <jx:IDDetails xsi:type="cs:IDDetails">
                    <!-- SMC Checker -->
                    <jx:SSNID>
                        <jx:ID/>
                    </jx:SSNID>
                </jx:IDDetails>
            </jx:PersonName>
        </ncc:PersonResponse>
    </ncc:DataSection>
</ncc:Message>

I am looking to want to get the date value(s) and the comment above those date values. So something like this for the example xml above:

Comment: < !-- DOB --> (ncc:DataSection/ncc:PersonResponse)

Date: 1948-05-11 (ncc:DataSection/ncc:PersonResponse/ncc:PersonBirthDateText)

.

Comment: < !-- DOZ --> (ncc:DataSection/ncc:PersonResponse/n-p:Activity)

Date: 1996-04-04 (ncc:DataSection/ncc:PersonResponse/n-p:Activity/jx:ActivityDate)

.

Comment: < !-- DOB Newest --> (ncc:DataSection/ncc:PersonResponse/jx:PersonName)

Date:

  1993-05-12 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)
  1993-05-13 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)
  1993-05-14 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)

The code I am trying to do this with is:

public static void xpathNodes() throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
    File file = new File(base_);
    XPath xPath = XPathFactory.newInstance().newXPath();
    //String expression = "//*[not(*)]";
    String expression = "([0-9]{4})-([0-9]{2})-([0-9]{2})";
    DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = builderFactory.newDocumentBuilder();
    Document document = builder.parse(file);
    document.getDocumentElement().normalize();
    NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(document, XPathConstants.NODESET);

    for (int i = 0; i < nodeList.getLength(); i++) {
        System.out.println(getXPath(nodeList.item(i)));
    }
}

private static String getXPath(Node node) {
    Node parent = node.getParentNode();

    if (parent == null) {
        return node.getNodeName();
    }

    return getXPath(parent) + "/" + node.getNodeName();
}

public static void main(String[] args) throws Exception {
    xpathNodes();
}

I know the Regex (([0-9]{4})-([0-9]{2})-([0-9]{2})) works as I have used it in Notepad++ and it works just fine there finding the dates within the opened xml file.

I am currently getting the error:

Exception in thread "main" javax.xml.transform.TransformerException: A location path was expected, but the following token was encountered: [

This doesn't even take in consideration the comments yet.

Any help would be great!

RegExp are part of XPath since version 2.0.- Also, it would be better to have a well-formed input sample — Alejandro
– Alejandro, Commented Jan 14, 2020 at 15:49
In addition to what @Alejandro commented, realize that even with XPath 2.0, an regex by itself is not an XPath, yet your code is calling xPath.compile(expression) as if it were. See matches() within a predicate if you're using an XPath 2.0 processor, or adopt two stages of XPath + Java regex processing if you're limited to XPath 1.0. — kjhughes
– kjhughes, Commented Jan 14, 2020 at 16:28

Alejandro · Accepted Answer · 2020-01-14 20:47:45Z

1

For an XPath 1.0 expression without RegEx you might well use:

//*[string-length()=10]
   [number(substring(.,1,4))=substring(.,1,4)]
   [substring(.,5,1)='-']
   [number(substring(.,6,2))=substring(.,6,2)]
   [substring(.,8,1)='-']
   [number(substring(.,9,2))=substring(.,9,2)]
|
//*[string-length()=10]
   [number(substring(.,1,4))=substring(.,1,4)]
   [substring(.,5,1)='-']
   [number(substring(.,6,2))=substring(.,6,2)]
   [substring(.,8,1)='-']
   [number(substring(.,9,2))=substring(.,9,2)]
   /preceding-sibling::node()[normalize-space()][1][self::comment()]

Do note: there is some duplicated expression because you wanted to select elements and comments nodes. The expression use the well known idiom for number testing. Finally and because there is no guarantee about the parser setting for white space only text nodes, before the position predicated the normalize-space() function is used.

Test in here

Edit: enforcing string length.

answered Jan 14, 2020 at 20:47

Alejandro

1,8828 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

StealthRT Over a year ago

Thanks for the reply @Alejandro but from the test Im not seeing how i use that within my java code?

Michael Kay · Accepted Answer · 2020-01-14 20:26:09Z

1

You have supplied a Regex expression to an API that expects an XPath expression.

You can use regular expressions with XPath but you will need a processor that supports XPath 2.0 or later (for example Saxon). The XPath processor that comes with the JDK still only supports the ancient XPath 1.0 standard, which has no regex support.

You can't supply a regex directly to xpath.compile(), but you can supply an XPath expression of the form //*[matches(., '--my regex--')].

If you do decide to go down the Saxon route, I would recommend using Saxon's internal tree model rather than DOM, as this executes XPath typically five to ten times faster than DOM.

answered Jan 14, 2020 at 20:26

Michael Kay

165k11 gold badges97 silver badges173 bronze badges

Collectives™ on Stack Overflow

Parsing XML getting comment(s) and date value(s) only

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related