My current project is concerned with parsing natural language. One test reads text from a file, removes certain characters, and tokenizes the text into single words. The test actually compares the number of unique words. In eclipse, this test is "green", in maven, I get a higher number of words than expected. Comparing the lists of words, I see the following additional words:
- acquirer⊙s
- card⊙s
- institution⊙s
- issuer⊙s
- provider⊙s
- psam⊙s
- ⊜from⊝
- ⊜slot⊝
- ⊜to⊝
Looking at the text source, it contains the following characters which should be filtered away: “ ” ’
This works in eclipse, but not in maven. I am using utf-8. The files seem to be encoded correctly, in the maven pom I specify the following:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<org.apache.lucene.version>3.6.0</org.apache.lucene.version>
</properties>
Edit: Here is the code that reads the file (which is, according to eclipse, encoded as UTF-8).
BufferedReader reader = new BufferedReader(
new FileReader(this.file));
String line = "";
while ((line = reader.readLine()) != null) {
// the csv contains a text and a classification
String[] reqCatType = line.split(";");
String reqText = reqCatType[0].trim();
String reqCategory = reqCatType[1].trim();
// the tokenizer also removes unwanted characters:
String[] sentence = this.filter.filterStopWords(this.tokenizer
.tokenize(reqText));
// we use this data to train a machine learning algorithm
this.dataSet.learn(sentence, reqCategory);
}
reader.close();
Edit: The following information might be useful for analyzing the problem:
mvn -v
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"