0

I have a text document in which each line is an entire US patent XML document. I am trying to parse it to remove certain features like the patent number, etc. I haven't used XPath before, so I'm borrowing some code I found from Ravi Thapliyal at Parse XML Simple String using Java XPath. However, apparently the initial !DOCTYPE tag is causing the DocumentBuilder to try to find the actual document somewhere?

Here is my first attempt at code:

//convert entire file to ArrayList of strings
        ArrayList<String> doc = new ArrayList<>();
        while(input.hasNext()){
            doc.add(input.nextLine().trim());
        }

int index = 0;
    while(index < doc.size()){
        String xml = doc.get(index);
        XPathFactory xpathFactory = XPathFactory.newInstance();
        XPath xPath = xpathFactory.newXPath();
        InputSource source = new InputSource(new StringReader(xml));

        db.setEntityResolver(new EntityResolver() {
            public InputSource resolveEntity(String publicId, String systemId)
             throws SAXException, java.io.IOException {
                if (systemId.contains("us-patent-grant-v40-2004-12-02.dtd")) {
            return new InputSource(new StringReader(""));
        } else {
            return null;
        }
            }
        });

        String orgName = "";
        try {
            orgName = (String) xPath.evaluate("/agents/adressbook/orgname", source,XPathConstants.STRING);
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("Document #" + index + " Company: " + orgName);
    }//end while loop that goes through each line (patent document) in file

The beginning of each line in the input file contains the following after the DOCTYPE declaration: us-patent-grant SYSTEM "us-patent-grant-v40-2004-12-02.dtd" [ ]>

The line that causes the problem (91) is:

orgName = (String) xPath.evaluate("/agents/adressbook/orgname", 
       source,XPathConstants.STRING);

And the stacktrace is:

java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:131)
    at java.io.FileInputStream.<init>(FileInputStream.java:87)
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616)
Document #0 Company: 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260)
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466)
    at Parser.main(Parser.java:102)
--------------- linked to ------------------
javax.xml.xpath.XPathExpressionException: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:473)
    at Parser.main(Parser.java:102)
Caused by: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:131)
    at java.io.FileInputStream.<init>(FileInputStream.java:87)
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260)
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466)

Can someone help me figure out what I should be doing to parse a document in a string?

0

2 Answers 2

1

Try setting features or supply empty EntityResolver

For features you need to find what parser implementation do you use (they are implementation specific)

Make DocumentBuilder.parse ignore DTD references

Sign up to request clarification or add additional context in comments.

6 Comments

I've tried that and still get the same error. I've altered the original question to show the new code and the stacktrace as requested. Thanks.
have you tried builder.setEntityResolver code from link?
Yes, I just tried that and am getting the exact same stack trace.
Okay, put in the stack trace and code using setEntityResolver. I'm running out of time, so I might have to try another approach entirely. Thanks, guys.
|
0

Have you tried supplying the DTD file it's trying to reference, e.g. download it from us-patent-application-v40-2004-12-02.dtd?

You can try putting this file in the same folder as the XML; or in the current directory of the parsing process (try both since you're in a hurry).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.