Memory issues with String.split()

Question

My programs currently has memory problems, and upon checking the app, we've discovered that the String.split() method uses lots of memory. I've tried using a StreamTokenizer, but it seems this makes things even more complex.

Is there a better way to split long Strings into small Strings that uses less memory than the String.split() method?

How many times do you split this string really? Can you show some code? — Oskar Kjellin
– Oskar Kjellin, Commented Aug 9, 2012 at 12:47
Did you try StringTokenizer in java? docs.oracle.com/javase/1.4.2/docs/api/java/util/… — sundar
– sundar, Commented Aug 9, 2012 at 12:48
split creates a lot of garbage but doesn't use very much memory. I suspect your memory problem is elsewhere. What do you see when you use a memory profiler? — Peter Lawrey
– Peter Lawrey, Commented Aug 9, 2012 at 13:38
As i remembered, i use the String[] strs to refer to the returned value from String.split. and then put each of the String of strs into HashMap as the key. that`s may be the cause of the memory issue. — Myth Pro
– Myth Pro, Commented Aug 9, 2012 at 13:50

Bohemian · Accepted Answer · 2012-08-09 13:58:43Z

1

It is highly unlikely that any realistic use of split would "consume lots of memory". Your input would have to be huge (many, many megabytes) and your result split into many millions of parts for it to even be noticed.

Here's some code that creates a random string of approximately 1.8 million characters and splits it into over 1 million Strings and outputs the memory used and time taken.

As you can see, it ain't much: 61Mb consumed in just 350ms.

public static void main(String[] args) throws Exception {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 99999; i++) {
        sb.append(Math.random());
    }
    long begin = System.currentTimeMillis();
    String string = sb.toString();
    sb = null;
    System.gc();
    long startFreeMem = Runtime.getRuntime().freeMemory();
    String[] strings = string.split("(?=[0-5])");
    long endFreeMem = Runtime.getRuntime().freeMemory();
    long execution = System.currentTimeMillis() - begin;

    System.out.println("input length = " + string.length() + "\nnumber of strings after split = " + strings.length + "\nmemory consumed due to split = "
            + (startFreeMem - endFreeMem) + "\nexecution time = " + execution + "ms");
}

Output (run on fairly typical windows box):

input length = 1827035
number of strings after split = 1072788
memory consumed due to split = 71740240
execution time = 351ms

Interestingly, without System.gc() the memory used was about 1/3:

memory consumed due to split = 29582328

edited Aug 9, 2012 at 13:58

answered Aug 9, 2012 at 13:21

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Peter Lawrey Over a year ago

I would use -XX:-UseTLAB for more accurate memory usage and move the sb.toString() to before your start Splitting on 60% of characters is a little crazy. How about [0-1]?

Peter Lawrey Over a year ago

Pet hate: Don't use StringBuffer is you can use StringBuilder. ;)

Bohemian Over a year ago

@PeterLawrey Crap! I meant to use StringBuilder. And I moved the toString(). Bit less memory used by split(). Splitting on so much because I wanted to create a crazy number of String objects to prove split() is not the villain.

Peter Lawrey Over a year ago

Did you try turning off the TLAB? (Just curious) I would also try System.gc() before getting the memory used to ignore temporary objects (OP appears to be concerned about retained objects)

Bohemian Over a year ago

@PeterLawrey Turning off TLAB reduced memory used to about 25Mb, however running gc() made memory used leap to 71Mb!

|

prilia · Accepted Answer · 2012-08-09 13:32:21Z

0

You need to use some kind of stream reader and not to abuse the memory with big data string. here some example :

 public static void readString(String str) throws IOException {
        InputStream is = new ByteArrayInputStream(str.getBytes("UTF-8"));

        char[] buf = new char[2048];
        Reader r = new InputStreamReader(is, "UTF-8");

        while (true) {
            int n = r.read(buf);
            if (n < 0)
                break;

            /*
             StringBuilder s = new StringBuilder();
             s.append(buf, 0, n);
             ... now you can parse the StringBuilder ...  
            */
        }
    }

edited Aug 9, 2012 at 13:32

answered Aug 9, 2012 at 13:25

prilia

1,0166 gold badges21 silver badges44 bronze badges

Comments

assylias · Accepted Answer · 2012-08-09 13:38:37Z

0

Split does not create brand new strings, it uses substring internally which creates a new String object that points to the right substring of the original string, without copying the underlying char[].

So apart from the (slight) overhead of object creation, it should not have a huge impact from a memory perspective.

ps: StringTokenizer uses the same technique so it would probably yield the same results as split.

EDIT

To see that it is the case, you can use the sample code below. It splits abc,def into abc and def then prints the underlying char[] of the original string and of the split strings - the output shows that they are all the same.

Output:

Reference: [C@3590ed52  Content: [a, b, c, ,, d, e, f]
Reference: [C@3590ed52  Content: [a, b, c, ,, d, e, f]
Reference: [C@3590ed52  Content: [a, b, c, ,, d, e, f]

Code:

public static void main(String[] args) throws InterruptedException, NoSuchFieldException, IllegalArgumentException, IllegalAccessException {
    String s = "abc,def";
    String[] ss = s.split(",");
    Field f = String.class.getDeclaredField("value");
    f.setAccessible(true);
    System.out.println("Reference: " + f.get(s) + "\tContent: " + Arrays.toString((char[])f.get(s)));
    System.out.println("Reference: " + f.get(ss[0]) + "\tContent: " + Arrays.toString((char[])f.get(ss[0])));
    System.out.println("Reference: " + f.get(ss[1]) + "\tContent: " + Arrays.toString((char[])f.get(ss[1])));
}

edited Aug 9, 2012 at 13:38

answered Aug 9, 2012 at 12:54

assylias

330k84 gold badges680 silver badges806 bronze badges

3 Comments

Myth Pro Over a year ago

in the substring method, it will finally create a new String with the new String(offset + beginIndex, endIndex - beginIndex, value), and when new String(xxx) was called, the jvmll make a copy of char[] value in the cache pool once its not existed , is that correct ?

assylias Over a year ago

no the new string will use the same char[] as the original one. Now if you split to only keep a small portion of the original string, you will actually carry the whole string around as you keep a reference to its char[] in each of the substrings.

Myth Pro Over a year ago

Sorry, i just read the source code, and i was wrong. it does use the copy to improve the memory performance

andy · Accepted Answer · 2012-08-09 13:41:13Z

split maybe aspect memory if you just want to use one or few arrays of the long string. the long string will always in memory. like

private static List<String> headlist = new ArrayList<String>();

String longstring = ".....";
headlist.add(longstring.split(" ")[0]);

than the longstring will always in memory. JVM cannot gc it.

in this situation, I think maybe you can try

private static List<String> headlist = new ArrayList<String>();

String longstring = ".....";
headlist.add(new String(longstring.split(" ")[0]));

as following code

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class SplitTest {
    static Random rand = new Random();
    static List<String> head = new ArrayList<String>();

    /**
     * @param args
     */
    public static void main(String[] args) {
        while(true) {
            String a = constructLongString();
            head.add(a.split(" ")[0]); //1
            //head.add(new String(a.split(" ")[0])); //2
            if (i % 1000 == 0)
                System.out.println("" + i);
            System.gc();
        }
    }

    private static String constructLongString() {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 10; i++) {
            sb.append(rand.nextInt(10));
        }
        sb.append(" ");
        for (int i = 0; i < 4096; i++) {
            sb.append(rand.nextInt(10));
        }
        return sb.toString();
    }
}

if you running with -Xmx60M, it will outofmemory about 6000+ and if you using code line 2, comment the line 1, then it running long time more bigger than 6000

hi, good point and thanks. I will check the code to see whether the mistake exists

Collectives™ on Stack Overflow

Memory issues with String.split()

4 Answers 4

8 Comments

Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related