creating large csv files in Java getting really slow

Question

i have a performance problem when trying to create a csv file starting from another csv file. this is how the original file looks:

country,state,co,olt,olu,splitter,ont,cpe,cpe.latitude,cpe.longitude,cpe.customer_class,cpe.phone,cpe.ip,cpe.subscriber_id
COUNTRY-0001,STATE-0001,CO-0001,OLT-0001,OLU0001,SPLITTER-0001,ONT-0001,CPE-0001,28.21487,77.451775,ALL,SIP:[email protected],SIP:[email protected],CPE_SUBSCRIBER_ID-QHLHW4
COUNTRY-0001,STATE-0002,CO-0002,OLT-0002,OLU0002,SPLITTER-0002,ONT-0002,CPE-0002,28.294018,77.068924,ALL,SIP:[email protected],SIP:[email protected],CPE_SUBSCRIBER_ID-AH8NJQ

potentially it could be millions of lines like this, i have detected the problem with 1.280.000 lines.

this is the algorithm:

File csvInputFile = new File(csv_path);
int blockSize = 409600;
brCsvInputFile = new BufferedReader(frCsvInputFile, blockSize);

String line = null;
StringBuilder sbIntermediate = new StringBuilder();
skipFirstLine(brCsvInputFile);
while ((line = brCsvInputFile.readLine()) != null) {
    createIntermediateStringBuffer(sbIntermediate, line.split(REGEX_COMMA));
}


private static void skipFirstLine(BufferedReader br) throws IOException {
    String line = br.readLine();
    String[] splitLine = line.split(REGEX_COMMA);
    LOGGER.debug("First line detected! ");
    createIndex(splitLine);
    createIntermediateIndex(splitLine);
}

private static void createIndex(String[] splitLine) {
    LOGGER.debug("START method createIndex.");
    for (int i = 0; i < splitLine.length; i++)
        headerIndex.put(splitLine[i], i);
    printMap(headerIndex);
    LOGGER.debug("COMPLETED method createIndex.");
}

    private static void createIntermediateIndex(String[] splitLine) {

    LOGGER.debug("START method createIntermediateIndex.");
    com.tekcomms.c2d.xml.model.v2.Metadata_element[] metadata_element = null;
    String[] servicePath = newTopology.getElement().getEntity().getService_path().getLevel();

    if (newTopology.getElement().getMetadata() != null)
        metadata_element = newTopology.getElement().getMetadata().getMetadata_element();

    LOGGER.debug(servicePath.toString());
    LOGGER.debug(metadata_element.toString());

    headerIntermediateIndex.clear();
    int indexIntermediateId = 0;
    for (int i = 0; i < servicePath.length; i++) {
        String level = servicePath[i];
        LOGGER.debug("level is: " + level);
        headerIntermediateIndex.put(level, indexIntermediateId);
        indexIntermediateId++;
        // its identificator is going to be located to the next one
        headerIntermediateIndex.put(level + "ID", indexIntermediateId);
        indexIntermediateId++;
    }
    // adding cpe.latitude,cpe.longitude,cpe.customer_class, it could be
    // better if it would be metadata as well.
    String labelLatitude = newTopology.getElement().getEntity().getLatitude();
    // indexIntermediateId++;
    headerIntermediateIndex.put(labelLatitude, indexIntermediateId);
    String labelLongitude = newTopology.getElement().getEntity().getLongitude();
    indexIntermediateId++;
    headerIntermediateIndex.put(labelLongitude, indexIntermediateId);
    String labelCustomerClass = newTopology.getElement().getCustomer_class();
    indexIntermediateId++;
    headerIntermediateIndex.put(labelCustomerClass, indexIntermediateId);

    // adding metadata
    // cpe.phone,cpe.ip,cpe.subscriber_id,cpe.vendor,cpe.model,cpe.customer_status,cpe.contact_telephone,cpe.address,
    // cpe.city,cpe.state,cpe.zip,cpe.bootfile,cpe.software_version,cpe.hardware_version
    // now i need to iterate over each Metadata_element belonging to
    // topology.element.metadata
    // are there any metadata?
    if (metadata_element != null && metadata_element.length != 0)
        for (int j = 0; j < metadata_element.length; j++) {
            String label = metadata_element[j].getLabel();
            label = label.toLowerCase();
            LOGGER.debug(" ==label: " + label + " index_pos: " + j);
            indexIntermediateId++;
            headerIntermediateIndex.put(label, indexIntermediateId);
        }

    printMap(headerIntermediateIndex);
    LOGGER.debug("COMPLETED method createIntermediateIndex.");
}

Reading the entire dataset, 1.280.000 lines take 800 ms! so the problem is in this method

    private static void createIntermediateStringBuffer(StringBuilder sbIntermediate, String[] splitLine) throws ClassCastException,
        NullPointerException {

    LOGGER.debug("START method createIntermediateStringBuffer.");
    long start, end;
    start = System.currentTimeMillis();
    ArrayList<String> hashes = new ArrayList<String>();
    com.tekcomms.c2d.xml.model.v2.Metadata_element[] metadata_element = null;

    String[] servicePath = newTopology.getElement().getEntity().getService_path().getLevel();
    LOGGER.debug(servicePath.toString());

    if (newTopology.getElement().getMetadata() != null) {
        metadata_element = newTopology.getElement().getMetadata().getMetadata_element();
        LOGGER.debug(metadata_element.toString());
    }

    for (int i = 0; i < servicePath.length; i++) {
        String level = servicePath[i];
        LOGGER.debug("level is: " + level);
        if (splitLine.length > getPositionFromIndex(level)) {
            String name = splitLine[getPositionFromIndex(level)];
            sbIntermediate.append(name);
            hashes.add(name);
            sbIntermediate.append(REGEX_COMMA).append(HashUtils.calculateHash(hashes)).append(REGEX_COMMA);
            LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
        }
    }

    //      end=System.currentTimeMillis();
    //      LOGGER.info("COMPLETED adding name hash. " + (end - start) + " ms. " + (end - start) / 1000 + " seg.");
    // adding cpe.latitude,cpe.longitude,cpe.customer_class, it should be
    // better if it would be metadata as well.
    String labelLatitude = newTopology.getElement().getEntity().getLatitude();
    if (splitLine.length > getPositionFromIndex(labelLatitude)) {
        String lat = splitLine[getPositionFromIndex(labelLatitude)];
        sbIntermediate.append(lat).append(REGEX_COMMA);
    }

    String labelLongitude = newTopology.getElement().getEntity().getLongitude();
    if (splitLine.length > getPositionFromIndex(labelLongitude)) {
        String lon = splitLine[getPositionFromIndex(labelLongitude)];
        sbIntermediate.append(lon).append(REGEX_COMMA);
    }
    String labelCustomerClass = newTopology.getElement().getCustomer_class();
    if (splitLine.length > getPositionFromIndex(labelCustomerClass)) {
        String customerClass = splitLine[getPositionFromIndex(labelCustomerClass)];
        sbIntermediate.append(customerClass).append(REGEX_COMMA);
    }
    //      end=System.currentTimeMillis();
    //      LOGGER.info("COMPLETED adding lat,lon,customer. " + (end - start) + " ms. " + (end - start) / 1000 + " seg.");
    // watch out metadata are optional, it can appear as a void chain!
    if (metadata_element != null && metadata_element.length != 0)
        for (int j = 0; j < metadata_element.length; j++) {
            String label = metadata_element[j].getLabel();
            LOGGER.debug(" ==label: " + label + " index_pos: " + j);
            if (splitLine.length > getPositionFromIndex(label)) {
                String actualValue = splitLine[getPositionFromIndex(label)];
                if (!"".equals(actualValue))
                    sbIntermediate.append(actualValue).append(REGEX_COMMA);
                else
                    sbIntermediate.append("").append(REGEX_COMMA);
            } else
                sbIntermediate.append("").append(REGEX_COMMA);
            LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
        }//for
    sbIntermediate.append("\n");
    end = System.currentTimeMillis();
    LOGGER.info("COMPLETED method createIntermediateStringBuffer. " + (end - start) + " ms. ");
}

As you can see, this method adds a precalculated line to the StringBuffer, reads every line from input csv file, calculate new data from that lines and finally add the generated line to the StringBuffer, so finally i can create the file with that buffer.

I have run jconsole and i can see that there are no memory leaks, i can see the sawtooths representing the creation of objects and the gc recollecting garbaje. It never traspasses the memory heap threshold.

One thing i have noticed is that the time needed for add a new line to the StringBuffer is completed within a very few ms range, (5,6,10), but is raising with time, to (100-200) ms and i suspect more in a near future, so probably this is the battle horse.

I have tried to analyze the code, i know that there are 3 for loops, but they are very shorts, the first loop iterates over 8 elements only:

for (int i = 0; i < servicePath.length; i++) {
        String level = servicePath[i];
        LOGGER.debug("level is: " + level);
        if (splitLine.length > getPositionFromIndex(level)) {
            String name = splitLine[getPositionFromIndex(level)];
            sbIntermediate.append(name);
            hashes.add(name);
            sbIntermediate.append(REGEX_COMMA).append(HashUtils.calculateHash(hashes)).append(REGEX_COMMA);
            LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
        }
    }

I have meassured the time needed to get the name from the splitline and it is worthless, 0 ms, the same to calculateHash method, 0 ms.

the other loop, are practically the same, iterates over 0 to n, where n is a very tiny int, 3 to 10 for example, so i do not understand why it takes more time to finish the method, the only thing i find is that to add a new line to the buffer is getting slow the process.

I am thinking about a producer consumer multi threaded strategy, a reader thread that reads every line and put them into a circular buffer, another threads take it one by one, process them and add a precalculated line to the StringBuffer, which is thread safe, when the file is fully readed, the reader thread sends a message to to the another threads telling them to stop. Finally i have to save this buffer to a file. What do you think? this is a good idea?

Seeing newTopology.getElement().getEntity() and all, you might use temp variables. And profile the application; the NetBeans IDE can do it out of the box, but in general that should be worth investigating. Some toString calls in logging might be costly. Also first writing to a file could be better. At least povide an initial capacity based on file size, new StringBuilder(100000);. — Joop Eggen
– Joop Eggen, Commented Dec 4, 2014 at 13:46
Hi Joop, thanks for the response, newTopology.getElement().getEntity() is a property parsed from a xml file and it is calculated only once a time. I am going to follow yor advise of the initial size of the StringBuilder. — aironman
– aironman, Commented Dec 4, 2014 at 14:06
Any time your StringBuilder needs to be expanded, Java creates a new char buffer and copies the current data into the new array and then releases the old array. As the StringBuilder gets bigger this process takes more and more time. Why don't you just write your results back out to a file immediately? — markbernard
– markbernard, Commented Dec 4, 2014 at 14:15
You should not cross-post the question, surely not without linking them. — maaartinus
– maaartinus, Commented Dec 4, 2014 at 15:19

maaartinus · Accepted Answer · 2014-12-04 15:16:04Z

I am thinking about a producer consumer multi threaded strategy, a reader thread that reads every line and put them into a circular buffer, another threads take it one by one, process them and add a precalculated line to the StringBuffer, which is thread safe, when the file is fully readed, the reader thread sends a message to to the another threads telling them to stop. Finally i have to save this buffer to a file. What do you think? this is a good idea?

Maybe, but it's quite a lot of work, I'd try something simpler first.

line.split(REGEX_COMMA)

Your REGEX_COMMA is a string which gets compiled into an regex a million times. It's trivial, but I'd try to use a Pattern instead.

You're producing a lot of garbage with your split. Maybe you should avoid it by manually splitting the input into a reused ArrayList<String> (it's just a few lines).

If all you need is writing the result into a file, it might be better to avoid building one huge String. Maybe a List<String> or even a List<StringBuilder> would be better, maybe writing directly to a buffered stream would do.

You seem to be working with ASCII only. Your encoding is platform dependent which may mean you're using UTF-8, which is possibly slow. Switching to a simpler encoding could help.

Working with byte[] instead of String would most probably help. Bytes are half as big as chars and there's no conversion needed when reading a file. All the operations you do can be done with bytes equally easy.

One thing i have noticed is that the time needed for add a new line to the StringBuffer is completed within a very few ms range, (5,6,10), but is raising with time, to (100-200) ms and i suspect more in a near future, so probably this is the battle horse.

That's resizing, which could be sped up by using the suggested ArrayList<String>, as the amount of data to be copied is much lower. Writing the data out when the buffer gets big would do as well.

I have meassured the time needed to get the name from the splitline and it is worthless, 0 ms, the same to calculateHash method, 0 ms.

Never use currentTimeMillis for this as nanoTime is strictly better. Use a profiler. The problem with a profiler is that it changes what it should measure. As a poor man's profiler, you can compute the sum of all the times spend inside of the suspect method and compare it with the total time.

What's the CPU load and what does GC do when running the program?

Bruce · Accepted Answer · 2014-12-04 13:50:03Z

1

I used superCSV library in my project to handle large set of lines. it is relatively fast than manually read the lines. Reference

answered Dec 4, 2014 at 13:50

Bruce

8,9099 gold badges59 silver badges83 bronze badges

Collectives™ on Stack Overflow

creating large csv files in Java getting really slow

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related