Most efficient way of splitting String in Java

Question

For the sake of this question, let's assume I have a String which contains the values Two;.Three;.Four (and so on) but the elements are separated by ;..

Now I know there are multiple ways of splitting a string such as split() and StringTokenizer (being the faster one and works well) but my input file is around 1GB and I am looking for something slightly more efficient than StringTokenizer.

After some research, I found that indexOf and substring are quite efficient but the examples only have single delimiters or results are returning only a single word/element.

Sample code using indexOf and substring:

String s = "quick,brown,fox,jumps,over,the,lazy,dog";
int from = s.indexOf(',');
int to = s.indexOf(',', from+1);
String brown = s.substring(from+1, to);

The above works for printing brown but how can I use indexOf and substring to split a line with multiple delimiters and display all the items as below.

Expected output

Two
Three
Four
....and so on

What are you trying to achieve? Have you done tests on various test cases and see which is "efficient"? — Buhake Sindi
– Buhake Sindi, Commented Mar 25, 2015 at 22:36
@BuhakeSindi Yes I have done tests. For a sample string on my machine using StringTokenizer took 8.0 us, split() took 23 us — user92038111111
– user92038111111, Commented Mar 25, 2015 at 22:39
Just loop, indexOf() takes a start parameter which is supposed to be the last found index. — eckes
– eckes, Commented Mar 25, 2015 at 22:57

Autumn Skye · Accepted Answer · 2017-08-30 15:46:39Z

7

This is the method I use for splitting large (1GB+) tab-separated files. It is limited to a char delimiter to avoid any overhead of additional method invocations (which may be optimized out by the runtime), but it can be easily converted to String-delimited. I'd be interested if anyone can come up with a faster method or improvements on this method.

public static String[] split(final String line, final char delimiter)
{
    CharSequence[] temp = new CharSequence[(line.length() / 2) + 1];
    int wordCount = 0;
    int i = 0;
    int j = line.indexOf(delimiter, 0); // first substring

    while (j >= 0)
    {
        temp[wordCount++] = line.substring(i, j);
        i = j + 1;
        j = line.indexOf(delimiter, i); // rest of substrings
    }

    temp[wordCount++] = line.substring(i); // last substring

    String[] result = new String[wordCount];
    System.arraycopy(temp, 0, result, 0, wordCount);

    return result;
}

answered Aug 30, 2017 at 15:46

Autumn Skye

7,56914 gold badges74 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sport Over a year ago

You can further improve this by obtaining all the indexes at once, as indexOf loops through the String

Autumn Skye Over a year ago

@Sport Inside the loop, I start each search after the index of the previous occurrence (line.indexOf(delimiter, i)), so each character is only checked once. I could probably write an inline version of indexOf(char, int) to avoid the overhead of repeated method invocation.

user207421 · Accepted Answer · 2015-03-25 22:53:04Z

5

If you want the ultimate in efficiency I wouldn't use Strings at all, let alone split them. I would do what compilers do: process the file a character at a time. Use a BufferedReader with a large buffer size, say 128kb, and read a char at a time, accumulating them into say a StringBuilder until you get a ; or line terminator.

answered Mar 25, 2015 at 22:53

user207421

312k45 gold badges324 silver badges493 bronze badges

3 Comments

user92038111111 Over a year ago

Okay will give this a try and report back. Thanks

user207421 Over a year ago

@AvinashRaj Your comment has nothing to do with my answer. Don't post irrelevant comments here.

user207421 Over a year ago

@AvinashRaj That doesn't have anything more to do with my answer than your previous comment.

user92038111111 · Accepted Answer · 2016-12-27 10:50:10Z

4

StringTokenizer is faster than StringBuilder.

public static void main(String[] args) {

    String str = "This is String , split by StringTokenizer, created by me";
    StringTokenizer st = new StringTokenizer(str);

    System.out.println("---- Split by space ------");
    while (st.hasMoreElements()) {
        System.out.println(st.nextElement());
    }

    System.out.println("---- Split by comma ',' ------");
    StringTokenizer st2 = new StringTokenizer(str, ",");

    while (st2.hasMoreElements()) {
        System.out.println(st2.nextElement());
    }
}

answered Dec 27, 2016 at 10:50

user92038111111

1912 gold badges2 silver badges9 bronze badges

1 Comment

Yonathan W'Gebriel Over a year ago

According to JDK Docs, StringTokenizer is considered a Legacy class for a while now. The recommendation is to use String.split or something from java.util.regex package.

Collectives™ on Stack Overflow

Most efficient way of splitting String in Java

3 Answers 3

2 Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related