Java String Memory Performance Improvement for Text File Parser

Question

I'm trying to optimize my memory usage when dealing with many (millions) of strings. I have a file of ~1.5 million lines that I'm reading through that contains decimal numbers in columns. For example, a file may look like

16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3
    36.794 3
16399226.418 6  -4129008.232 6        43.280 6  16399225.374 4  -1891751.787 4
    39.885 4
12415561.671 9 -33057782.339 9        52.412 9  12415567.518 8 -25595925.487 8
    49.950 8
15523362.628 5 -12597312.619 5        40.579 5  15523369.553 5  -9739990.371 5
    42.003 5
12369614.129 8 -28797729.913 8        50.068 8         0.000           0.000  
     0.000  
....

Currently I'm using String.split("\\s+") to separate these numbers, then calling Double.parseDouble() on each one of the parts of the String[], which looks something like:

String[] data = line.split("\\s+");
double firstValue = Double.parseDouble(data[0]);
double secondValue = Double.parseDouble(data[1]);
double thirdValue = Double.parseDouble(data[2]);

This ends up creating a lot of String objects. I also may have whitespace at the beginning or end of the line so I have to call trim() on the line before I split it, which also creates another String object. The garbage collector disposes of these String objects, which results in slowdowns. Are there more memory efficient constructs in Java to do this? I was thinking about using a char[] instead of a String but I'm not sure whether there would be a substantial improvement from that.

What kind of problems are you encountering with the current implementation? — biziclop
– biziclop, Commented Jul 23, 2015 at 15:23
Are you reading the entire file into memory? You should read one line at a time, parse it, and write it to a database. — Sualeh Fatehi
– Sualeh Fatehi, Commented Jul 23, 2015 at 15:25
@SualehFatehi no I'm storing the whole file in memory. I'm reading line-by-line and parsing. — David M
– David M, Commented Jul 23, 2015 at 15:27
@biziclop it's using 100MB of RAM to parse and it's creating about 700,000 String objects — David M
– David M, Commented Jul 23, 2015 at 15:28
@DavidM You could consider reading the file using BufferedReader. It wont load the entire file at once, but reads it line by line. Would this be a problem in the application? — Some guy
– Some guy, Commented Jul 23, 2015 at 15:29

OldCurmudgeon · Accepted Answer · 2015-07-24 08:26:59Z

3

If you are really certain that this is a severe bottleneck you could always parse your string directly into Doubles.

// Keep track of my state.
private static class AsDoublesState {

    // null means no number waiting for add.
    Double d = null;
    // null means not seen '.' yet.
    Double div = null;
    // Is it negative.
    boolean neg = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            if (d == null) {
                d = v;
            } else {
                d = d * 10 + v;
            }
            // Count digits after the dot.
            if (div != null) {
                div *= 10;
            }
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div = 1.0;
        } else if (ch == '-') {
            // Negate!
            neg = true;
        } else {
            // Everything else completes the number.
            if (d != null) {
                if (neg) {
                    d = -d;
                }
                if (div != null) {
                    doubles.add(d / div);
                } else {
                    doubles.add(d);
                }
                // Clear down.
                d = null;
                div = null;
                neg = false;
            }
        }
    }
}

private static List<Double> asDoubles(String s) {
    // Grow the list.
    List<Double> doubles = new ArrayList<>();
    // Track my state.
    AsDoublesState state = new AsDoublesState();

    for (int i = 0; i < s.length(); i++) {
        state.consume(doubles, s.charAt(i));
    }
    // Pretend a space on the end to flush an remaining number.
    state.consume(doubles, ' ');
    return doubles;
}

public void test() {
    String s = "16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3    36.794 3";
    List<Double> doubles = asDoubles(s);
    System.out.println(doubles);
}

This will break badly if given bad data. E.g. 123--56...392.86 would be a perfectly valid number to it, and 6.0221413e+23 would be two numbers.

Here's an improved State using AtomicDouble to avoid creating all of those Double objects`.

// Keep track of my state.
private static class AsDoublesState {

    // Mutable double value.
    AtomicDouble d = new AtomicDouble();
    // Mutable double value.
    AtomicDouble div = new AtomicDouble();
    // Is there a number.
    boolean number = false;
    // Is it negative.
    boolean negative = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            d.set(d.get() * 10 + v);
            number = true;
            // Count digits after the dot.
            div.set(div.get() * 10);
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div.set(1.0);
        } else if (ch == '-') {
            // Negate!
            negative = true;
        } else {
            // Everything else completes the number.
            if (number) {
                double v = d.get();
                if (negative) {
                    v = -v;
                }
                if (div.get() != 0) {
                    v = v / div.get();
                }
                doubles.add(v);
                // Clear down.
                d.set(0);
                div.set(0);
                number = false;
                negative = false;
            }
        }
    }
}

edited Jul 24, 2015 at 8:26

answered Jul 23, 2015 at 16:02

OldCurmudgeon

66.1k18 gold badges126 silver badges220 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

biziclop Over a year ago

And I wouldn't try it close to the maximum/minimum range of double either as it probably accumulates error badly. But if all you do with the numbers is plot a graph with them, I wouldn't worry too much about it.

OldCurmudgeon Over a year ago

@biziclop - Very important point! Thanks for mentioning it. We could accumulate into a long or BigInteger and do just one divide at the end to side-step most error accumulations, or even use a BigDecimal.

maaartinus Over a year ago

@biziclop There's no roundoff error as long as the number with omitted decimal point fits into 53 bits. All inputs in the questions have at most 12 digits, so they do. Using long would be surely better.

maaartinus Over a year ago

You're generating much more garbage than the OP does. There's a new double for each input character, and after seeing the decimal point, there are two of them!

biziclop Over a year ago

@maaartinus Yes, it should use primitive types and an enum to keep track of where the parsing is (rather than relying on null checks).

|

Mikel Rychliski · Accepted Answer · 2015-07-23 16:02:52Z

1

Try using Pattern and Matcher to split the string with a compiled regular expression:

double[][] out = new double[2][2];
String[] data = new String[2];
data[0] = "1 2";
data[1] = "3 2";

Pattern pat = Pattern.compile("\\s*(\\d+\\.?\\d*)?\\s+?(\\d+\\.?\\d*)?\\s*");
Matcher mat = pat.matcher(data[0]);
mat.find();

out[0][0] = Double.parseDouble(mat.group(1));
out[0][1] = Double.parseDouble(mat.group(2));

mat = pat.matcher(data[1]);
mat.find();
out[1][0] = Double.parseDouble(mat.group(1));
out[1][1] = Double.parseDouble(mat.group(2));

edited Jul 23, 2015 at 16:02

answered Jul 23, 2015 at 15:35

Mikel Rychliski

3,6275 gold badges25 silver badges32 bronze badges

3 Comments

biziclop Over a year ago

This doesn't convert the values into doubles. But it eliminates the need for calling trim(), so it's a good step forward.

Mikel Rychliski Over a year ago

Yes that's correct, should have mentioned I didn't test that code segment.. Fixed the answer

maaartinus Over a year ago

You produce one less garbage object per line using Matcher#reset.

Community · Accepted Answer · 2017-05-23 11:51:11Z

1

We have similar problem in our application where lot of strings getting created and we did few things which help us fixing the issue.

Give more memory to Java if available e.g. -Xmx2G for 2gb.
If you're on 32 bit JVM then you only assign up to 4 gb - theoratical limit. So move to 64 bit.
Profile your application

You need to do it step by step:

Start with visualvm (click here for detail) and measure how many String, Double objects are getting created, size, time etc.
Use one of the Flyweight pattern and intern string objects. Guava library has Interner. Even, you can do even double also. This will avoid duplicate and cache the object using weak references, e.g. here

Interner<String> interner = Interners.newWeakInterner(); String a = interner.intern(getStringFromCsv()); String b = interner.intern(getStringFromCsv());

Copied from here

Profile your application again.

You can also use scanner to read double from file, your code will be cleaner and scanner also cache double values as it uses Double.ValueOf.

Here's the code

File file = new File("double_file.txt");
        Scanner scan = new Scanner(file);
        while(scan.hasNextDouble()) {
            System.out.println( scan.nextDouble() );
        }
        scan.close();

You can use this code and see if there is any GAIN in the performance or not.

edited May 23, 2017 at 11:51

CommunityBot

11 silver badge

answered Jul 23, 2015 at 15:41

TheCodingFrog

3,5343 gold badges24 silver badges27 bronze badges

1 Comment

David M Over a year ago

I'm actually doing this on Android so I'm not able to give myself more memory, unfortunately. Just did a profile using DDMS and parseDouble takes 38% of my execution time and split takes 15, my biggest chunks.

Community · Accepted Answer · 2017-05-23 12:21:43Z

1

You don't have to produce any garbage at all:

Use BufferedInputStream in order to avoid char[] and String creation. There are no non-ASCII characters, so you deal with bytes directly.
Write a parser similar to this solution, but avoiding any garbage.
Let your class provide a method like double nextDouble(), which reads characters until a next double is assembled.
If you need a line-wise processing, watch for \n (ignore \r as it's just a needless addendum to \n; lone \r used to be used as line separator long time ago).

edited May 23, 2017 at 12:21

CommunityBot

11 silver badge

answered Jul 23, 2015 at 17:11

maaartinus

46.8k40 gold badges176 silver badges343 bronze badges

Collectives™ on Stack Overflow

Java String Memory Performance Improvement for Text File Parser

4 Answers 4

8 Comments

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related