11

For the sake of this question, let's assume I have a String which contains the values Two;.Three;.Four (and so on) but the elements are separated by ;..

Now I know there are multiple ways of splitting a string such as split() and StringTokenizer (being the faster one and works well) but my input file is around 1GB and I am looking for something slightly more efficient than StringTokenizer.

After some research, I found that indexOf and substring are quite efficient but the examples only have single delimiters or results are returning only a single word/element.

Sample code using indexOf and substring:

String s = "quick,brown,fox,jumps,over,the,lazy,dog";
int from = s.indexOf(',');
int to = s.indexOf(',', from+1);
String brown = s.substring(from+1, to);

The above works for printing brown but how can I use indexOf and substring to split a line with multiple delimiters and display all the items as below.

Expected output

Two
Three
Four
....and so on
6
  • 1
    What are you trying to achieve? Have you done tests on various test cases and see which is "efficient"? Commented Mar 25, 2015 at 22:36
  • There's also an indexOf overload that takes String... Commented Mar 25, 2015 at 22:37
  • you mean this string.replaceAll(";\\.", "\n"); ? Commented Mar 25, 2015 at 22:37
  • @BuhakeSindi Yes I have done tests. For a sample string on my machine using StringTokenizer took 8.0 us, split() took 23 us Commented Mar 25, 2015 at 22:39
  • 1
    Just loop, indexOf() takes a start parameter which is supposed to be the last found index. Commented Mar 25, 2015 at 22:57

3 Answers 3

7

This is the method I use for splitting large (1GB+) tab-separated files. It is limited to a char delimiter to avoid any overhead of additional method invocations (which may be optimized out by the runtime), but it can be easily converted to String-delimited. I'd be interested if anyone can come up with a faster method or improvements on this method.

public static String[] split(final String line, final char delimiter)
{
    CharSequence[] temp = new CharSequence[(line.length() / 2) + 1];
    int wordCount = 0;
    int i = 0;
    int j = line.indexOf(delimiter, 0); // first substring

    while (j >= 0)
    {
        temp[wordCount++] = line.substring(i, j);
        i = j + 1;
        j = line.indexOf(delimiter, i); // rest of substrings
    }

    temp[wordCount++] = line.substring(i); // last substring

    String[] result = new String[wordCount];
    System.arraycopy(temp, 0, result, 0, wordCount);

    return result;
}
Sign up to request clarification or add additional context in comments.

2 Comments

You can further improve this by obtaining all the indexes at once, as indexOf loops through the String
@Sport Inside the loop, I start each search after the index of the previous occurrence (line.indexOf(delimiter, i)), so each character is only checked once. I could probably write an inline version of indexOf(char, int) to avoid the overhead of repeated method invocation.
5

If you want the ultimate in efficiency I wouldn't use Strings at all, let alone split them. I would do what compilers do: process the file a character at a time. Use a BufferedReader with a large buffer size, say 128kb, and read a char at a time, accumulating them into say a StringBuilder until you get a ; or line terminator.

3 Comments

Okay will give this a try and report back. Thanks
@AvinashRaj Your comment has nothing to do with my answer. Don't post irrelevant comments here.
@AvinashRaj That doesn't have anything more to do with my answer than your previous comment.
4

StringTokenizer is faster than StringBuilder.

public static void main(String[] args) {

    String str = "This is String , split by StringTokenizer, created by me";
    StringTokenizer st = new StringTokenizer(str);

    System.out.println("---- Split by space ------");
    while (st.hasMoreElements()) {
        System.out.println(st.nextElement());
    }

    System.out.println("---- Split by comma ',' ------");
    StringTokenizer st2 = new StringTokenizer(str, ",");

    while (st2.hasMoreElements()) {
        System.out.println(st2.nextElement());
    }
}

1 Comment

According to JDK Docs, StringTokenizer is considered a Legacy class for a while now. The recommendation is to use String.split or something from java.util.regex package.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.