5

This may be beyond the capabilities of the Java VM due to the size of the files being dealt with (50-100MB xml files)

Right now I have a set of xml files sent as zips, which are in turn all decompressed and then all XML in the directory are processed one at a time using SAX.

To save time and space (since the compression is about 1:10) I was wondering if there is a way to pass a ZipFileEntry that is an xml file to a SAX handler.

I've seen it done using DocumentBuilder and other xml parsing methods, but for peformance (and especially memory) I'm sticking with SAX.

Currently I am using SAX in the following way

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        MyHandler handler = new MyHandler();

        for( String curFile : xmlFiles )
        {
            System.out.println( "\n\n\t>>>>> open " + curFile + " <<<<<\n");
            saxParser.parse( "file://" + new File( dirToProcess + curFile ).getAbsolutePath(), handler );
        }
0

2 Answers 2

9

You can parse a XML using an InputStream as a source. So you can open a ZipFile, get the InputStream of the entry you want, and then parse it. See the getInputStream method.

---- Edit ----

Here is some code to guide you:

for( String curFile : xmlFiles )
{
        ZipFile zip = new ZipFile(new File( dirToProcess + curFile));
        Enumeration<? extends ZipEntry> entries = zip.entries();
        while (entries.hasMoreElements()){
            ZipEntry entry = entries.nextElement();
            InputStream xmlStream = zip.getInputStream(entry);
            saxParser.parse( xmlStream, handler );
            xmlStream.close();
        }
}
Sign up to request clarification or add additional context in comments.

2 Comments

See sample current implementation I added above, not sure how to use streams with the sax parse call
Appears to be working - though it will take a good 30 min to run - very large files.
1
  • ZipInputStream.read() would read x number of bytes from the ZipFileEntry, unzip them and give you the unzipped bytes.
  • Use any of the methods here to create an in/out stream.
  • Give that in/out stream as InputStream to your parser.
  • Start writing unzipped data to in/out stream (now treated as OutputStream).
  • So you're now reading chunks of data from zip file, unzipping them and passing them to the parser.

PS:

  1. If the zip file contains multiple files see this: extracting contents of ZipFile entries when read from byte[] (Java), you'll have to put in a check such that you know when you reach end of an entry.
  2. I donno much of SAX parser but assume that it would parse the file in this manner (when given in chunks).

--- edit ---

Here is what I meant:

import java.io.File;
import java.io.InputStream;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.util.Enumeration;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Main {
    static class MyRunnable implements Runnable {

        private InputStream xmlStream;
        private SAXParser sParser;

        public MyRunnable(SAXParser p, InputStream is) {
            sParser = p;
            xmlStream = is;
        }

        public void run() {
            try {
                sParser.parse(xmlStream, new DefaultHandler() {
                    public void startElement(String uri, String localName, String qName, Attributes attributes)
                            throws SAXException {
                        System.out.println("\nStart Element :" + qName);
                    }

                    public void endElement(String uri, String localName, String qName) throws SAXException {
                        System.out.println("\nEnd Element :" + qName);
                    }
                });
                System.out.println("Done parsing..");
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

    }

    final static int BUF_SIZE = 5;
    public static void main(String argv[]) {

        try {

            SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();

            ZipFile zip = new ZipFile(new File("D:\\Workspaces\\Indigo\\Test\\performance.zip"));
            Enumeration<? extends ZipEntry> entries = zip.entries();
            while (entries.hasMoreElements()) {
                // in stream for parser..
                PipedInputStream xmlStream = new PipedInputStream();
                // out stream attached to in stream above.. we would read from zip file and write to this..
                // thus passing whatever we write to the parser..
                PipedOutputStream out = new PipedOutputStream(xmlStream);
                // Parser blocks in in stream, so put him on a different thread..
                Thread parserThread = new Thread(new Main.MyRunnable(saxParser, xmlStream));
                parserThread.start();

                ZipEntry entry = entries.nextElement();
                System.out.println("\nOpening zip entry: " + entry.getName());
                InputStream unzippedStream = zip.getInputStream(entry);

                byte buf[] = new byte[BUF_SIZE]; int bytesRead = 0;
                while ((bytesRead = unzippedStream.read(buf)) > 0) {
                    // write to err for different color in eclipse..
                    System.err.write(buf, 0, bytesRead);
                    out.write(buf, 0, bytesRead);
                    Thread.sleep(150); // theatrics...
                }

                out.flush();
                // give parser a couple o seconds to catch up just in case there is some IO lag...
                parserThread.join(2000);

                unzippedStream.close(); out.close(); xmlStream.close();
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

1 Comment

See sample current implementation I added above, not sure how to use streams with the sax parse call

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.