How to improve performance of String.split?

Question

I have large (100GB) CSV files that I periodically want to convert/reduce.

Each line is separated by a newline, where each line represents a record that I want to convert into a new record.

I therefore first have to split the line, extract my desired fields (+ convert some with logic), and save again as a reduced csv.

Each line contains 2500-3000 chars separated by a ^, with about 1000 fields. But I need only approx 120 fields from that split.

I discovered that the actual line splitting takes most of the processing time. I'm thus trying to find the best split method for one line.

My benchmark on a sample size of 10GB is as follows:

approx 5mins:

static final Pattern.compile("\\^");
String[] split = pattern.split(line);

2min37s:

String split[] = string.split("\\^");

2min18s (approx 12% faster than string.split()):

StringTokenizer tokenizer = new StringTokenizer(line, "\\^");
String[] split = new String[tokenizer.countTokens()];
int j = 0;
while (tokenizer.hasMoreTokens()) {
    split[j] = tokenizer.nextToken();
    j++;
}

Question: could there be more room for improvement?

How long of lines are we looking at? A file with ten million lines that are 50 characters each should be a lot easier to deal with in the abstract than a file with 50 lines that ten million characters each. The latter will push the limits of Java's string type. — Silvio Mayolo
– Silvio Mayolo, Commented Nov 11, 2021 at 20:47
Not necessarily answering your question - any reason not to use a proper CSV reader? — khachik
– khachik, Commented Nov 11, 2021 at 20:55
Did not try indexOf because the CSV fields are NOT fixed length. Not using a CSV reader because each library has another overhead, plus having not control on the split method. — membersound
– membersound, Commented Nov 11, 2021 at 21:00

Nowhere Man · Accepted Answer · 2021-11-15 17:27:26Z

If the first 120 fields/columns in a line/row are relevant, it may be better to use a "limited" edition of String::split(String delim, int limit) which will split only for these ~10% of the columns.

// keep relevant 120 fields + 1 field for the tail part
String[] properFields = string.split("\\^", 121);

Update

If there is a prefix of N fields to be skipped and the next M fields need to be processed, a regular expression may be used to define parts of the line and then only the important part should be picked up.

The latter may be implemented using Stream<MatchResult> retrieved via Scanner or Matcher

String str = "aaa^bbb^ccc^11^22^33^44^55^xyz\r\npp^qqq^sss^99^88^77^66^55^$313";

// N = M = 3
Pattern line = Pattern.compile("(?<prefix>(([^^\\r\\n]+\\^){3}))(?<body>(([^^\\r\\n]+\\^?){0,3})).*\\R?");
Pattern field = Pattern.compile("\\^"); // caret-separated fields

System.out.println("matcher results");
line.matcher(str)
    .results()
    .map(mr -> mr.group(4))
    .flatMap(s -> Arrays.stream(s.split("\\^")))
    .forEach(System.out::println);

System.out.println("scanner findall");
Scanner scan = new Scanner(str);
scan.findAll(line)
    .map(mr -> mr.group(4))
    .flatMap(field::splitAsStream)
    .forEach(System.out::println);

Output:

matcher results
11
22
33
99
88
77
scanner findall
11
22
33
99
88
77

However, regular expressions may also affect performance, so the simplest way to handle large strings may be to implement custom method to return substring between N and N + M occurrence of the delimiter:

public static String substring(char delim, int from, int to, String line) {
    int index = 0;
    int count = 0;
    int n = line.length();
    for (; index < n && count < from; index++) {
        if (line.charAt(index) == delim) {
            count++;
        }
    }
    int indexFrom = index;
    for (; index < n && count < to; index++) {
        if (line.charAt(index) == delim) {
            count++;
        }
    }
    return line.substring(indexFrom, index);
}

System.out.println("scanner plain");
scan = new Scanner(str);
while(scan.hasNextLine()) {
    System.out.println(substring('^', 3, 6, scan.nextLine()));
}

Output:

scanner plain
11^22^33^
99^88^77^

Great idea, unfortunately the fields are distributed over the full length. Anyways, do you know a way that I could only extract parts of the String between eg delimiter 350 - 450 (not index!)? I could then extract only the string portions of the fields to extract, and split only parts of the string. But how could I extract string parts between nth and nth+X delimiter index?

Collectives™ on Stack Overflow

How to improve performance of String.split?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related