Hello,
I'm currently working on a word prediction in Java. For this, I'm using a NGram based model, but I have some memory issues...
In a first time, I had a model like this :
public class NGram implements Serializable {
private static final long serialVersionUID = 1L;
private transient int count;
private int id;
private NGram next;
public NGram(int idP) {
this.id = idP;
}
}
But it's takes a lot of memory, so I thought I need optimization, and I thought, if I have "hello the world" and "hello the people", instead of get two ngram, I could keep in one that keep "Hello the" and then have two possibilty : "people" and "world".
To be more clear, this is my new model :
public class BNGram implements Serializable {
private static final long serialVersionUID = 1L;
private int id;
private HashMap<Integer,BNGram> next;
private int count = 1;
public BNGram(int idP) {
this.id = idP;
this.next = new HashMap<Integer, BNGram>();
}
}
But it seems that my second model consume twice more memory... I think it's because of HashMap, but I don't how to reduce this? I tried to use different Map implementations like Trove or others, but it don't change any thing.
To give you a idea, for a text of 9MB with 57818 different word (different, but it's not the total number of word), after NGram generation, my javaw process consume 1.2GB of memory... If I save it with GZIPOutputStream, it takes arround 18MB on the disk.
So my question is : how can I do to use less memory ? Can I make something with compression (as the Serialization). I need to add this to a other application, so I need to reduce the memory usage before...
Thanks a lot, and sorry for my bad english...
ZiMath