0

Hey all I am trying to see if I can read an XML file and only gather the tags that have the date formatted like YYYY-MM-DD.

Here is an online example: https://repl.it/repls/MedicalIgnorantEfficiency

Here is an example of my xml to parse:

<?xml version="1.0" encoding="UTF-8"?>
<ncc:Message xmlns:ncc="http://blank/1.0.6" 
xmlns:cs="http://blank/1.0.0" 
xmlns:jx="http://blank/1.0.0"
xmlns:jm="http://blank/1.0.0"
xmlns:n-p="http://blank/1.0.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://blank/1.0.6/person person.xsd">
    <ncc:DataSection>
        <ncc:PersonResponse>
            <!-- Message -->
            <cs:CText cs:type="No">NO WANT</cs:CText>
            <jm:CaseID>
                <!-- OEA -->
                <jm:ID>ABC123</jm:ID>
            </jm:CaseID>
            <jx:PersonName>
                <!-- NAM -->
                <jx:GivenName>Arugula</jx:GivenName>
                <jx:MiddleName>Pibb</jx:MiddleName>
                <jx:SurName>Atari</jx:SurName>
            </jx:PersonName>
            <!-- DOB -->
            <ncc:PersonBirthDateText>1948-05-11</ncc:PersonBirthDateText>
            <jx:PersonDetails>
                <!-- SXC -->
                <jx:PersonSSN>
                    <jx:ID/>
                </jx:PersonSSN>
            </jx:PersonDetails>
            <n-p:Activity>
                <!--DOZ-->
                <jx:ActivityDate>1996-04-04</jx:ActivityDate>
                <jx:HomeAgency xsi:type="cs:Organization">
                    <!-- ART -->
                    <jx:Organization>
                        <jx:ID>ZR5981034</jx:ID>
                    </jx:Organization>
                </jx:HomeAgency>
            </n-p:Activity>
            <jx:PersonName>
                <!-- DOB Newest -->
                <ncc:BirthDateText>1993-05-12</ncc:BirthDateText>
                <ncc:BirthDateText>1993-05-13</ncc:BirthDateText>
                <ncc:BirthDateText>1993-05-14</ncc:BirthDateText>
                <jx:IDDetails xsi:type="cs:IDDetails">
                    <!-- SMC Checker -->
                    <jx:SSNID>
                        <jx:ID/>
                    </jx:SSNID>
                </jx:IDDetails>
            </jx:PersonName>
        </ncc:PersonResponse>
    </ncc:DataSection>
</ncc:Message>

I am looking to want to get the date value(s) and the comment above those date values. So something like this for the example xml above:

Comment: < !-- DOB --> (ncc:DataSection/ncc:PersonResponse)

Date: 1948-05-11 (ncc:DataSection/ncc:PersonResponse/ncc:PersonBirthDateText)

.

Comment: < !-- DOZ --> (ncc:DataSection/ncc:PersonResponse/n-p:Activity)

Date: 1996-04-04 (ncc:DataSection/ncc:PersonResponse/n-p:Activity/jx:ActivityDate)

.

Comment: < !-- DOB Newest --> (ncc:DataSection/ncc:PersonResponse/jx:PersonName)

Date:

  1993-05-12 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)
  1993-05-13 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)
  1993-05-14 (ncc:DataSection/ncc:PersonResponse/jx:PersonName/ncc:BirthDateText)

The code I am trying to do this with is:

public static void xpathNodes() throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
    File file = new File(base_);
    XPath xPath = XPathFactory.newInstance().newXPath();
    //String expression = "//*[not(*)]";
    String expression = "([0-9]{4})-([0-9]{2})-([0-9]{2})";
    DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = builderFactory.newDocumentBuilder();
    Document document = builder.parse(file);
    document.getDocumentElement().normalize();
    NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(document, XPathConstants.NODESET);

    for (int i = 0; i < nodeList.getLength(); i++) {
        System.out.println(getXPath(nodeList.item(i)));
    }
}

private static String getXPath(Node node) {
    Node parent = node.getParentNode();

    if (parent == null) {
        return node.getNodeName();
    }

    return getXPath(parent) + "/" + node.getNodeName();
}

public static void main(String[] args) throws Exception {
    xpathNodes();
}

I know the Regex (([0-9]{4})-([0-9]{2})-([0-9]{2})) works as I have used it in Notepad++ and it works just fine there finding the dates within the opened xml file.

I am currently getting the error:

Exception in thread "main" javax.xml.transform.TransformerException: A location path was expected, but the following token was encountered: [

This doesn't even take in consideration the comments yet.

Any help would be great!

4
  • 1
    RegExp are part of XPath since version 2.0.- Also, it would be better to have a well-formed input sample Commented Jan 14, 2020 at 15:49
  • @Alejandro i'll see what i can do Commented Jan 14, 2020 at 15:49
  • In addition to what @Alejandro commented, realize that even with XPath 2.0, an regex by itself is not an XPath, yet your code is calling xPath.compile(expression) as if it were. See matches() within a predicate if you're using an XPath 2.0 processor, or adopt two stages of XPath + Java regex processing if you're limited to XPath 1.0. Commented Jan 14, 2020 at 16:28
  • @Alejandro repl.it/repls/MedicalIgnorantEfficiency Commented Jan 14, 2020 at 17:50

2 Answers 2

1

For an XPath 1.0 expression without RegEx you might well use:

//*[string-length()=10]
   [number(substring(.,1,4))=substring(.,1,4)]
   [substring(.,5,1)='-']
   [number(substring(.,6,2))=substring(.,6,2)]
   [substring(.,8,1)='-']
   [number(substring(.,9,2))=substring(.,9,2)]
|
//*[string-length()=10]
   [number(substring(.,1,4))=substring(.,1,4)]
   [substring(.,5,1)='-']
   [number(substring(.,6,2))=substring(.,6,2)]
   [substring(.,8,1)='-']
   [number(substring(.,9,2))=substring(.,9,2)]
   /preceding-sibling::node()[normalize-space()][1][self::comment()]

Do note: there is some duplicated expression because you wanted to select elements and comments nodes. The expression use the well known idiom for number testing. Finally and because there is no guarantee about the parser setting for white space only text nodes, before the position predicated the normalize-space() function is used.

Test in here

Edit: enforcing string length.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the reply @Alejandro but from the test Im not seeing how i use that within my java code?
1

You have supplied a Regex expression to an API that expects an XPath expression.

You can use regular expressions with XPath but you will need a processor that supports XPath 2.0 or later (for example Saxon). The XPath processor that comes with the JDK still only supports the ancient XPath 1.0 standard, which has no regex support.

You can't supply a regex directly to xpath.compile(), but you can supply an XPath expression of the form //*[matches(., '--my regex--')].

If you do decide to go down the Saxon route, I would recommend using Saxon's internal tree model rather than DOM, as this executes XPath typically five to ten times faster than DOM.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.