6

Hello,

I'm currently working on a word prediction in Java. For this, I'm using a NGram based model, but I have some memory issues...

In a first time, I had a model like this :

public class NGram implements Serializable {
    private static final long serialVersionUID = 1L;

    private transient int count;
    private int id;
    private NGram next;

    public NGram(int idP) {
        this.id = idP;
    }
}

But it's takes a lot of memory, so I thought I need optimization, and I thought, if I have "hello the world" and "hello the people", instead of get two ngram, I could keep in one that keep "Hello the" and then have two possibilty : "people" and "world".

To be more clear, this is my new model :

public class BNGram implements Serializable {
    private static final long serialVersionUID = 1L;
    private int id;
    private HashMap<Integer,BNGram> next;
    private int count = 1;

    public BNGram(int idP) {
        this.id = idP;
        this.next = new HashMap<Integer, BNGram>();
    }
}

But it seems that my second model consume twice more memory... I think it's because of HashMap, but I don't how to reduce this? I tried to use different Map implementations like Trove or others, but it don't change any thing.

To give you a idea, for a text of 9MB with 57818 different word (different, but it's not the total number of word), after NGram generation, my javaw process consume 1.2GB of memory... If I save it with GZIPOutputStream, it takes arround 18MB on the disk.

So my question is : how can I do to use less memory ? Can I make something with compression (as the Serialization). I need to add this to a other application, so I need to reduce the memory usage before...

Thanks a lot, and sorry for my bad english...

ZiMath

5
  • Did you try to analyse your memory consumption? E.g. with jvisualvm? Commented Mar 21, 2015 at 12:11
  • I'm currently using Java mission control to get better informations about memory consumption. Commented Mar 21, 2015 at 12:13
  • I'm not familiar with n-gram problem, is linked list (your first solution) a recommended approach? Commented Mar 21, 2015 at 12:16
  • 1
    Every object consumes a minimum of 32 bytes, and perhaps 48, depending on the JVM. A String is actually two objects (and, of course, the characters take up space as well). A HashMap entry is another object. For applications such as yours the best way to reduce memory consumption is to use a database to hold the bulk data. Commented Mar 21, 2015 at 12:17
  • LinkedList is not a recommended approach because it decrease the performance of prediction (HashMap is fast because get(...) operation are fast, so NGram with HashMap are very efficient in time). I know that every object consume memory, and that's why I doesn't store the word (in String) but in int. Do you have a idea for a embedded database ? I tried H2 but it also consume a lot of memory ! Commented Mar 21, 2015 at 12:20

2 Answers 2

5

You need a specialized structure to achieve what you want.

Take a look at Apache's PatriciaTrie. It's like a Map, but it's memory-wise and works with Strings. It's also extremely fast: operations are O(k), with k being the number of bits of the largest key.

It has an operation that suits your immediate needs: prefixMap(), which returns a SortedMap view of the trie that contains Strings which are prefixed by the given key.

A brief usage example:

public class Patricia {

    public static void main(String[] args) {

        PatriciaTrie<String> trie = new PatriciaTrie<>();

        String world = "hello the world";
        String people = "hello the people";

        trie.put(world, null);
        trie.put(people, null);

        SortedMap<String, String> map1 = trie.prefixMap("hello");
        System.out.println(map1.keySet());  // [hello the people, hello the world]

        SortedMap<String, String> map2 = trie.prefixMap("hello the w");
        System.out.println(map2.keySet()); // [hello the world]

        SortedMap<String, String> map3 = trie.prefixMap("hello the p");
        System.out.println(map3.keySet());  // [hello the people]
    }
}

There are also the tests, which contain more examples.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks a lot, I will try it, then I will post the result. Before your answer I tried a TIntObjectHashMap (from Trove) with a Map creation on the need (Map is not created in the constructor but only when adding some element), and I get 600MB usage. We will see if the PatriciaTrie can do better ! Thank you
@zimath I can't tell which one consumes less memory. Please let me know when you have the results of the comparison. PatriciaTrie has the advantage of its operations. I'll add an example to my answer.
Thank you @Magnamag that was a nice try, but with a simple test with only String (and without any supp informations as the count in my model), I have a memory usage of 900MB, so it doesn't match my requirement. And when I think about my second solution, it's a kind of PatriciaTrie. Thanks again ;)
I will try, but I'm not sure that will save a lof of space, because in Java, String are cached, so "hello" and "hello", even if one is on key and the other is on the value, will "point" on the same String object.
@zimath Q regarding that 9mb text of ~60k different words. Would you mind to explain what's in there? And how are you supposed to predict words based on that file's info? Just to have more context and better understand your problem.
|
4

Here, I'm primarily trying to explain why you are observing such an excessive memory consumption, and what you could do about this (if you wanted to stick to the HashMap) :

A HashMap that is created with the default constructor will have an initial capacity of 16. This means that it will have space for 16 entries, even if it is empty. Additionally, you seem to create the map, regardless of whether it is needed or not.

So way to reduce the memory consumption in your case would be to

  • Create the map only when it is necessary
  • Create it with a smaller initial capacity

Applied to your class, this could roughly look like this:

public class BNGram {
    private int id;
    private Map<Integer,BNGram> next;

    public BNGram(int idP) {
        this.id = idP;
        // (Do not create a new `Map` here!)
    }

    void doSomethingWhereTheMapIsNeeded(Integer key, BNGram value) {

        // Create a map, when required, with an initial capacity of 1
        if (next == null) {
            next = new HashMap<Integer, BNGram>(1);
        }
        next.put(key, value);
    }
}

But...

... conceptually, it is questionable to have a large "tree" structure consisting of many, many maps, each only with "few" entries. This suggests that a different data structure is more appropriate here. So you should definitely prefer a solution like the one in the answer by Magnamag, or (if this is not applicable for you, as suggested in your comments), look out for an alternative data structure - maybe even by formulating this as a new question that does not suffer from the XY Problem.

1 Comment

I already solve the Map creation problem (as said in my last comment), I save me arround 300MB (by using a cache if I have only one child in my BNGram). But you're right, may that I have to search for a different data structure... Thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.