2

I'm trying to optimize my memory usage when dealing with many (millions) of strings. I have a file of ~1.5 million lines that I'm reading through that contains decimal numbers in columns. For example, a file may look like

16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3
    36.794 3
16399226.418 6  -4129008.232 6        43.280 6  16399225.374 4  -1891751.787 4
    39.885 4
12415561.671 9 -33057782.339 9        52.412 9  12415567.518 8 -25595925.487 8
    49.950 8
15523362.628 5 -12597312.619 5        40.579 5  15523369.553 5  -9739990.371 5
    42.003 5
12369614.129 8 -28797729.913 8        50.068 8         0.000           0.000  
     0.000  
....

Currently I'm using String.split("\\s+") to separate these numbers, then calling Double.parseDouble() on each one of the parts of the String[], which looks something like:

String[] data = line.split("\\s+");
double firstValue = Double.parseDouble(data[0]);
double secondValue = Double.parseDouble(data[1]);
double thirdValue = Double.parseDouble(data[2]);

This ends up creating a lot of String objects. I also may have whitespace at the beginning or end of the line so I have to call trim() on the line before I split it, which also creates another String object. The garbage collector disposes of these String objects, which results in slowdowns. Are there more memory efficient constructs in Java to do this? I was thinking about using a char[] instead of a String but I'm not sure whether there would be a substantial improvement from that.

16
  • What kind of problems are you encountering with the current implementation? Commented Jul 23, 2015 at 15:23
  • Are you reading the entire file into memory? You should read one line at a time, parse it, and write it to a database. Commented Jul 23, 2015 at 15:25
  • @SualehFatehi no I'm storing the whole file in memory. I'm reading line-by-line and parsing. Commented Jul 23, 2015 at 15:27
  • @biziclop it's using 100MB of RAM to parse and it's creating about 700,000 String objects Commented Jul 23, 2015 at 15:28
  • @DavidM You could consider reading the file using BufferedReader. It wont load the entire file at once, but reads it line by line. Would this be a problem in the application? Commented Jul 23, 2015 at 15:29

4 Answers 4

3

If you are really certain that this is a severe bottleneck you could always parse your string directly into Doubles.

// Keep track of my state.
private static class AsDoublesState {

    // null means no number waiting for add.
    Double d = null;
    // null means not seen '.' yet.
    Double div = null;
    // Is it negative.
    boolean neg = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            if (d == null) {
                d = v;
            } else {
                d = d * 10 + v;
            }
            // Count digits after the dot.
            if (div != null) {
                div *= 10;
            }
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div = 1.0;
        } else if (ch == '-') {
            // Negate!
            neg = true;
        } else {
            // Everything else completes the number.
            if (d != null) {
                if (neg) {
                    d = -d;
                }
                if (div != null) {
                    doubles.add(d / div);
                } else {
                    doubles.add(d);
                }
                // Clear down.
                d = null;
                div = null;
                neg = false;
            }
        }
    }
}

private static List<Double> asDoubles(String s) {
    // Grow the list.
    List<Double> doubles = new ArrayList<>();
    // Track my state.
    AsDoublesState state = new AsDoublesState();

    for (int i = 0; i < s.length(); i++) {
        state.consume(doubles, s.charAt(i));
    }
    // Pretend a space on the end to flush an remaining number.
    state.consume(doubles, ' ');
    return doubles;
}

public void test() {
    String s = "16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3    36.794 3";
    List<Double> doubles = asDoubles(s);
    System.out.println(doubles);
}

This will break badly if given bad data. E.g. 123--56...392.86 would be a perfectly valid number to it, and 6.0221413e+23 would be two numbers.

Here's an improved State using AtomicDouble to avoid creating all of those Double objects`.

// Keep track of my state.
private static class AsDoublesState {

    // Mutable double value.
    AtomicDouble d = new AtomicDouble();
    // Mutable double value.
    AtomicDouble div = new AtomicDouble();
    // Is there a number.
    boolean number = false;
    // Is it negative.
    boolean negative = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            d.set(d.get() * 10 + v);
            number = true;
            // Count digits after the dot.
            div.set(div.get() * 10);
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div.set(1.0);
        } else if (ch == '-') {
            // Negate!
            negative = true;
        } else {
            // Everything else completes the number.
            if (number) {
                double v = d.get();
                if (negative) {
                    v = -v;
                }
                if (div.get() != 0) {
                    v = v / div.get();
                }
                doubles.add(v);
                // Clear down.
                d.set(0);
                div.set(0);
                number = false;
                negative = false;
            }
        }
    }
}
Sign up to request clarification or add additional context in comments.

8 Comments

And I wouldn't try it close to the maximum/minimum range of double either as it probably accumulates error badly. But if all you do with the numbers is plot a graph with them, I wouldn't worry too much about it.
@biziclop - Very important point! Thanks for mentioning it. We could accumulate into a long or BigInteger and do just one divide at the end to side-step most error accumulations, or even use a BigDecimal.
@biziclop There's no roundoff error as long as the number with omitted decimal point fits into 53 bits. All inputs in the questions have at most 12 digits, so they do. Using long would be surely better.
You're generating much more garbage than the OP does. There's a new double for each input character, and after seeing the decimal point, there are two of them!
@maaartinus Yes, it should use primitive types and an enum to keep track of where the parsing is (rather than relying on null checks).
|
1

Try using Pattern and Matcher to split the string with a compiled regular expression:

double[][] out = new double[2][2];
String[] data = new String[2];
data[0] = "1 2";
data[1] = "3 2";

Pattern pat = Pattern.compile("\\s*(\\d+\\.?\\d*)?\\s+?(\\d+\\.?\\d*)?\\s*");
Matcher mat = pat.matcher(data[0]);
mat.find();

out[0][0] = Double.parseDouble(mat.group(1));
out[0][1] = Double.parseDouble(mat.group(2));

mat = pat.matcher(data[1]);
mat.find();
out[1][0] = Double.parseDouble(mat.group(1));
out[1][1] = Double.parseDouble(mat.group(2));

3 Comments

This doesn't convert the values into doubles. But it eliminates the need for calling trim(), so it's a good step forward.
Yes that's correct, should have mentioned I didn't test that code segment.. Fixed the answer
You produce one less garbage object per line using Matcher#reset.
1

We have similar problem in our application where lot of strings getting created and we did few things which help us fixing the issue.

  1. Give more memory to Java if available e.g. -Xmx2G for 2gb.
  2. If you're on 32 bit JVM then you only assign up to 4 gb - theoratical limit. So move to 64 bit.

  3. Profile your application

You need to do it step by step:

  • Start with visualvm (click here for detail) and measure how many String, Double objects are getting created, size, time etc.

  • Use one of the Flyweight pattern and intern string objects. Guava library has Interner. Even, you can do even double also. This will avoid duplicate and cache the object using weak references, e.g. here

    Interner<String> interner = Interners.newWeakInterner(); String a = interner.intern(getStringFromCsv()); String b = interner.intern(getStringFromCsv());

Copied from here

  • Profile your application again.

You can also use scanner to read double from file, your code will be cleaner and scanner also cache double values as it uses Double.ValueOf.

Here's the code

File file = new File("double_file.txt");
        Scanner scan = new Scanner(file);
        while(scan.hasNextDouble()) {
            System.out.println( scan.nextDouble() );
        }
        scan.close();

You can use this code and see if there is any GAIN in the performance or not.

1 Comment

I'm actually doing this on Android so I'm not able to give myself more memory, unfortunately. Just did a profile using DDMS and parseDouble takes 38% of my execution time and split takes 15, my biggest chunks.
1

You don't have to produce any garbage at all:

  • Use BufferedInputStream in order to avoid char[] and String creation. There are no non-ASCII characters, so you deal with bytes directly.
  • Write a parser similar to this solution, but avoiding any garbage.
  • Let your class provide a method like double nextDouble(), which reads characters until a next double is assembled.
  • If you need a line-wise processing, watch for \n (ignore \r as it's just a needless addendum to \n; lone \r used to be used as line separator long time ago).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.