7

My programs currently has memory problems, and upon checking the app, we've discovered that the String.split() method uses lots of memory. I've tried using a StreamTokenizer, but it seems this makes things even more complex.

Is there a better way to split long Strings into small Strings that uses less memory than the String.split() method?

5
  • How many times do you split this string really? Can you show some code? Commented Aug 9, 2012 at 12:47
  • Did you try StringTokenizer in java? docs.oracle.com/javase/1.4.2/docs/api/java/util/… Commented Aug 9, 2012 at 12:48
  • hey i will try StringTokenizer, thanks. Commented Aug 9, 2012 at 13:18
  • 2
    split creates a lot of garbage but doesn't use very much memory. I suspect your memory problem is elsewhere. What do you see when you use a memory profiler? Commented Aug 9, 2012 at 13:38
  • 1
    As i remembered, i use the String[] strs to refer to the returned value from String.split. and then put each of the String of strs into HashMap as the key. that`s may be the cause of the memory issue. Commented Aug 9, 2012 at 13:50

4 Answers 4

1

It is highly unlikely that any realistic use of split would "consume lots of memory". Your input would have to be huge (many, many megabytes) and your result split into many millions of parts for it to even be noticed.

Here's some code that creates a random string of approximately 1.8 million characters and splits it into over 1 million Strings and outputs the memory used and time taken.

As you can see, it ain't much: 61Mb consumed in just 350ms.

public static void main(String[] args) throws Exception {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 99999; i++) {
        sb.append(Math.random());
    }
    long begin = System.currentTimeMillis();
    String string = sb.toString();
    sb = null;
    System.gc();
    long startFreeMem = Runtime.getRuntime().freeMemory();
    String[] strings = string.split("(?=[0-5])");
    long endFreeMem = Runtime.getRuntime().freeMemory();
    long execution = System.currentTimeMillis() - begin;

    System.out.println("input length = " + string.length() + "\nnumber of strings after split = " + strings.length + "\nmemory consumed due to split = "
            + (startFreeMem - endFreeMem) + "\nexecution time = " + execution + "ms");
}

Output (run on fairly typical windows box):

input length = 1827035
number of strings after split = 1072788
memory consumed due to split = 71740240
execution time = 351ms

Interestingly, without System.gc() the memory used was about 1/3:

memory consumed due to split = 29582328
Sign up to request clarification or add additional context in comments.

8 Comments

I would use -XX:-UseTLAB for more accurate memory usage and move the sb.toString() to before your start Splitting on 60% of characters is a little crazy. How about [0-1]?
Pet hate: Don't use StringBuffer is you can use StringBuilder. ;)
@PeterLawrey Crap! I meant to use StringBuilder. And I moved the toString(). Bit less memory used by split(). Splitting on so much because I wanted to create a crazy number of String objects to prove split() is not the villain.
Did you try turning off the TLAB? (Just curious) I would also try System.gc() before getting the memory used to ignore temporary objects (OP appears to be concerned about retained objects)
@PeterLawrey Turning off TLAB reduced memory used to about 25Mb, however running gc() made memory used leap to 71Mb!
|
0

You need to use some kind of stream reader and not to abuse the memory with big data string. here some example :

 public static void readString(String str) throws IOException {
        InputStream is = new ByteArrayInputStream(str.getBytes("UTF-8"));

        char[] buf = new char[2048];
        Reader r = new InputStreamReader(is, "UTF-8");

        while (true) {
            int n = r.read(buf);
            if (n < 0)
                break;

            /*
             StringBuilder s = new StringBuilder();
             s.append(buf, 0, n);
             ... now you can parse the StringBuilder ...  
            */
        }
    }

Comments

0

Split does not create brand new strings, it uses substring internally which creates a new String object that points to the right substring of the original string, without copying the underlying char[].

So apart from the (slight) overhead of object creation, it should not have a huge impact from a memory perspective.

ps: StringTokenizer uses the same technique so it would probably yield the same results as split.

EDIT

To see that it is the case, you can use the sample code below. It splits abc,def into abc and def then prints the underlying char[] of the original string and of the split strings - the output shows that they are all the same.

Output:

Reference: [C@3590ed52  Content: [a, b, c, ,, d, e, f]
Reference: [C@3590ed52  Content: [a, b, c, ,, d, e, f]
Reference: [C@3590ed52  Content: [a, b, c, ,, d, e, f]

Code:

public static void main(String[] args) throws InterruptedException, NoSuchFieldException, IllegalArgumentException, IllegalAccessException {
    String s = "abc,def";
    String[] ss = s.split(",");
    Field f = String.class.getDeclaredField("value");
    f.setAccessible(true);
    System.out.println("Reference: " + f.get(s) + "\tContent: " + Arrays.toString((char[])f.get(s)));
    System.out.println("Reference: " + f.get(ss[0]) + "\tContent: " + Arrays.toString((char[])f.get(ss[0])));
    System.out.println("Reference: " + f.get(ss[1]) + "\tContent: " + Arrays.toString((char[])f.get(ss[1])));
}

3 Comments

in the substring method, it will finally create a new String with the new String(offset + beginIndex, endIndex - beginIndex, value), and when new String(xxx) was called, the jvmll make a copy of char[] value in the cache pool once its not existed , is that correct ?
no the new string will use the same char[] as the original one. Now if you split to only keep a small portion of the original string, you will actually carry the whole string around as you keep a reference to its char[] in each of the substrings.
Sorry, i just read the source code, and i was wrong. it does use the copy to improve the memory performance
0

split maybe aspect memory if you just want to use one or few arrays of the long string. the long string will always in memory. like

private static List<String> headlist = new ArrayList<String>();

String longstring = ".....";
headlist.add(longstring.split(" ")[0]);

than the longstring will always in memory. JVM cannot gc it.

in this situation, I think maybe you can try

private static List<String> headlist = new ArrayList<String>();

String longstring = ".....";
headlist.add(new String(longstring.split(" ")[0]));

as following code

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class SplitTest {
    static Random rand = new Random();
    static List<String> head = new ArrayList<String>();

    /**
     * @param args
     */
    public static void main(String[] args) {
        while(true) {
            String a = constructLongString();
            head.add(a.split(" ")[0]); //1
            //head.add(new String(a.split(" ")[0])); //2
            if (i % 1000 == 0)
                System.out.println("" + i);
            System.gc();
        }
    }

    private static String constructLongString() {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 10; i++) {
            sb.append(rand.nextInt(10));
        }
        sb.append(" ");
        for (int i = 0; i < 4096; i++) {
            sb.append(rand.nextInt(10));
        }
        return sb.toString();
    }
}

if you running with -Xmx60M, it will outofmemory about 6000+ and if you using code line 2, comment the line 1, then it running long time more bigger than 6000

1 Comment

hi, good point and thanks. I will check the code to see whether the mistake exists

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.