0

I have large (100GB) CSV files that I periodically want to convert/reduce.

Each line is separated by a newline, where each line represents a record that I want to convert into a new record.

I therefore first have to split the line, extract my desired fields (+ convert some with logic), and save again as a reduced csv.

Each line contains 2500-3000 chars separated by a ^, with about 1000 fields. But I need only approx 120 fields from that split.

I discovered that the actual line splitting takes most of the processing time. I'm thus trying to find the best split method for one line.

My benchmark on a sample size of 10GB is as follows:

approx 5mins:

static final Pattern.compile("\\^");
String[] split = pattern.split(line);

2min37s:

String split[] = string.split("\\^");

2min18s (approx 12% faster than string.split()):

StringTokenizer tokenizer = new StringTokenizer(line, "\\^");
String[] split = new String[tokenizer.countTokens()];
int j = 0;
while (tokenizer.hasMoreTokens()) {
    split[j] = tokenizer.nextToken();
    j++;
}

Question: could there be more room for improvement?

5
  • 1
    How long of lines are we looking at? A file with ten million lines that are 50 characters each should be a lot easier to deal with in the abstract than a file with 50 lines that ten million characters each. The latter will push the limits of Java's string type. Commented Nov 11, 2021 at 20:47
  • See my update, approx 3k chars per line Commented Nov 11, 2021 at 20:50
  • Have you tried String.indexOf() and String.substring()? Commented Nov 11, 2021 at 20:54
  • 1
    Not necessarily answering your question - any reason not to use a proper CSV reader? Commented Nov 11, 2021 at 20:55
  • Did not try indexOf because the CSV fields are NOT fixed length. Not using a CSV reader because each library has another overhead, plus having not control on the split method. Commented Nov 11, 2021 at 21:00

1 Answer 1

1

If the first 120 fields/columns in a line/row are relevant, it may be better to use a "limited" edition of String::split(String delim, int limit) which will split only for these ~10% of the columns.

// keep relevant 120 fields + 1 field for the tail part
String[] properFields = string.split("\\^", 121); 

Update

If there is a prefix of N fields to be skipped and the next M fields need to be processed, a regular expression may be used to define parts of the line and then only the important part should be picked up.

The latter may be implemented using Stream<MatchResult> retrieved via Scanner or Matcher

String str = "aaa^bbb^ccc^11^22^33^44^55^xyz\r\npp^qqq^sss^99^88^77^66^55^$313";

// N = M = 3
Pattern line = Pattern.compile("(?<prefix>(([^^\\r\\n]+\\^){3}))(?<body>(([^^\\r\\n]+\\^?){0,3})).*\\R?");
Pattern field = Pattern.compile("\\^"); // caret-separated fields

System.out.println("matcher results");
line.matcher(str)
    .results()
    .map(mr -> mr.group(4))
    .flatMap(s -> Arrays.stream(s.split("\\^")))
    .forEach(System.out::println);

System.out.println("scanner findall");
Scanner scan = new Scanner(str);
scan.findAll(line)
    .map(mr -> mr.group(4))
    .flatMap(field::splitAsStream)
    .forEach(System.out::println);

Output:

matcher results
11
22
33
99
88
77
scanner findall
11
22
33
99
88
77

However, regular expressions may also affect performance, so the simplest way to handle large strings may be to implement custom method to return substring between N and N + M occurrence of the delimiter:

public static String substring(char delim, int from, int to, String line) {
    int index = 0;
    int count = 0;
    int n = line.length();
    for (; index < n && count < from; index++) {
        if (line.charAt(index) == delim) {
            count++;
        }
    }
    int indexFrom = index;
    for (; index < n && count < to; index++) {
        if (line.charAt(index) == delim) {
            count++;
        }
    }
    return line.substring(indexFrom, index);
}

System.out.println("scanner plain");
scan = new Scanner(str);
while(scan.hasNextLine()) {
    System.out.println(substring('^', 3, 6, scan.nextLine()));
}

Output:

scanner plain
11^22^33^
99^88^77^
Sign up to request clarification or add additional context in comments.

1 Comment

Great idea, unfortunately the fields are distributed over the full length. Anyways, do you know a way that I could only extract parts of the String between eg delimiter 350 - 450 (not index!)? I could then extract only the string portions of the fields to extract, and split only parts of the string. But how could I extract string parts between nth and nth+X delimiter index?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.