2

I have a class which reads a CSV file but when size of file is high, the program throws Java heap size error, so I need to split that file into pieces and transfer lines to other files according to line size.

For example; I have a file of 500 000 lines and I'm dividing it into 5 files by 100 000 lines. So I have 5 files consisting of 100 000 lines so that I can read them.

I couldn't find a way to do that so it would be nice if I see example lines of code.

7
  • 1
    do you have to have all lines in memory? otherwise you could read line by line and do your processing. Commented Mar 17, 2020 at 14:09
  • 1
    You could also try increasing the heap size Commented Mar 17, 2020 at 14:11
  • @bwright I created a list of DTO which consists of a lines as you said. This question is my another option to read that high size CSV file. Do you have another option rather than splitting file into pieces? Commented Mar 17, 2020 at 14:18
  • 1
    You are supposed to show a honest attempt. The goals are to prove that you have researched and ensure that any solution provided by someone else will smoothly fit into your application. Commented Mar 17, 2020 at 14:23
  • 1
    Java has mechanisms (for example Files.lines) to work with these files. Process it as a stream by reading line by line. Commented Mar 17, 2020 at 14:24

3 Answers 3

3
public static void splitLargeFile(final String fileName, 
                                   final String extension, 
                                   final int maxLines,
                                   final boolean deleteOriginalFile) {

    try (Scanner s = new Scanner(new FileReader(String.format("%s.%s", fileName, extension)))) {
        int file = 0;
        int cnt = 0;
        BufferedWriter writer = new BufferedWriter(new FileWriter(String.format("%s_%d.%s", fileName, file, extension)));

        while (s.hasNext()) {
            writer.write(s.next() + System.lineSeparator());
            if (++cnt == maxLines && s.hasNext()) {
                writer.close();
                writer = new BufferedWriter(new  FileWriter(String.format("%s_%d.%s", fileName, ++file, extension)));
                cnt = 0;
            }
        }
        writer.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

    if (deleteOriginalFile) {
        try {
            File f = new File(String.format("%s.%s", fileName, extension));
            f.delete();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Sign up to request clarification or add additional context in comments.

Comments

0

If you're on Linux and you can run the CSV through a script first, then you can use "split":

$ split -l 100000 big.csv small-

This generates files named small-aa, small-ab, small-ac... To rename these to csv's if needed:

$ for a in small-*; do 
    mv $a $a.csv;                # rename split files to .csv 
    java MyCSVProcessor $a.csv;  # or just process them anyways 
done

Try this for additional options:

$ split -h

-a –suffix-length=N use suffixes of length N (default 2)
-b –bytes=SIZE put SIZE bytes per output file
-C –line-bytes=SIZE put at most SIZE bytes of lines per output file
-d –numeric-suffixes use numeric suffixes instead of alphabetic
-l –lines=NUMBER put NUMBER lines per output file

This is however a poor mitigation for your problem - the reason your CSV reader module is running out of memory, is because it's either reading the whole file into memory before splitting it, or it's doing that and keeping your processed output in memory. To make your code more portable and universally runnable, you should consider processing one line at a time - and splitting the input yourself, line by line. (From https://stackabuse.com/reading-and-writing-csvs-in-java/)

BufferedReader csvReader = new BufferedReader(new FileReader(pathToCsv));
while ((row = csvReader.readLine()) != null) {
    String[] data = row.split(",");
    // do something with the data
}
csvReader.close();

Caveat with the above code is that quoted commas will just be treated as new columns - you will have to add some additional processing if your CSV data contains quoted commas.

Of course, if you really want to use your existing code, and just want to split the file, you can adapt the above:

import java.io.*;

public class split {

    static String CSVFile="test.csv";
    static String row;
    static BufferedReader csvReader;
    static PrintWriter csvWriter;

    public static void main(String[] args) throws IOException {   

    csvReader = new BufferedReader(new FileReader(CSVFile));

    int line = 0;
    while ((row = csvReader.readLine()) != null) {
       if (line % 100000 == 0) {  // maximum lines per file
          if (line>0) { csvWriter.close(); }
          csvWriter = new PrintWriter("cut-"+Integer.toString(line)+CSVFile);
       }
       csvWriter.println(row);
        // String[] data = row.split(",");
        // do something with the data
       line++;
    }
    csvWriter.close();
    csvReader.close();

    }
}

I chose PrintWriter above FileWriter or BufferedWriter because it automatically prints the relevent newlines - and I would presume that it's buffered... I've not written anything in Java in 20 years, so I bet you can improve on the above.

Comments

0

I created a simple fun to create a childcsv from parent based on the start and last Range. It can be used as splitter based on line range.

public static void createcsv(String csvPath,String newcsvPath, int startRange, int lastRange) {
    csvPath = csvPath.trim();
    String childcsvPath = newcsvPath.trim();
    Scanner sc = null;
    FileWriter writer = null;
    int count = 0;
    // Iterate to startRange Location
    try {
        sc = new Scanner(new File(csvPath));
        sc.useDelimiter(","); // sets the delimiter pattern
        ArrayList<String> newCsv = new ArrayList<String>();

        while (sc.hasNextLine()) // returns a boolean value
        {
            String value = sc.nextLine();
            count++;
            if (count > lastRange)
                break;

            else if (count >= startRange) {
                newCsv.add(value);
            } else
                continue;
        }

        writer = new FileWriter(childcsvPath);

        for (int j = 0; j < newCsv.size(); j++) {
            writer.append(newCsv.get(j));
            writer.append("\n");
        }
    } catch (Exception e) {
        System.out.print("Exception Found" + e);
    } finally {
        if (sc != null) {
            try {
                sc.close();
                writer.close();
            } catch (Exception e) {
            }
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.