I have large (100GB) CSV files that I periodically want to convert/reduce.
Each line is separated by a newline, where each line represents a record that I want to convert into a new record.
I therefore first have to split the line, extract my desired fields (+ convert some with logic), and save again as a reduced csv.
Each line contains 2500-3000 chars separated by a ^, with about 1000 fields.
But I need only approx 120 fields from that split.
I discovered that the actual line splitting takes most of the processing time. I'm thus trying to find the best split method for one line.
My benchmark on a sample size of 10GB is as follows:
approx 5mins:
static final Pattern.compile("\\^");
String[] split = pattern.split(line);
2min37s:
String split[] = string.split("\\^");
2min18s (approx 12% faster than string.split()):
StringTokenizer tokenizer = new StringTokenizer(line, "\\^");
String[] split = new String[tokenizer.countTokens()];
int j = 0;
while (tokenizer.hasMoreTokens()) {
split[j] = tokenizer.nextToken();
j++;
}
Question: could there be more room for improvement?
String.indexOf()andString.substring()?indexOfbecause the CSV fields are NOT fixed length. Not using a CSV reader because each library has another overhead, plus having not control on the split method.