4

My current project is concerned with parsing natural language. One test reads text from a file, removes certain characters, and tokenizes the text into single words. The test actually compares the number of unique words. In eclipse, this test is "green", in maven, I get a higher number of words than expected. Comparing the lists of words, I see the following additional words:

  • acquirer⊙s
  • card⊙s
  • institution⊙s
  • issuer⊙s
  • provider⊙s
  • psam⊙s
  • ⊜from⊝
  • ⊜slot⊝
  • ⊜to⊝

Looking at the text source, it contains the following characters which should be filtered away: “ ” ’

This works in eclipse, but not in maven. I am using utf-8. The files seem to be encoded correctly, in the maven pom I specify the following:

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <org.apache.lucene.version>3.6.0</org.apache.lucene.version>
</properties>

Edit: Here is the code that reads the file (which is, according to eclipse, encoded as UTF-8).

        BufferedReader reader = new BufferedReader(
                new FileReader(this.file));
        String line = "";
        while ((line = reader.readLine()) != null) {
            // the csv contains a text and a classification
            String[] reqCatType = line.split(";");
            String reqText = reqCatType[0].trim();
            String reqCategory = reqCatType[1].trim();
            // the tokenizer also removes unwanted characters:
            String[] sentence = this.filter.filterStopWords(this.tokenizer
                    .tokenize(reqText));
            // we use this data to train a machine learning algorithm
            this.dataSet.learn(sentence, reqCategory);
        }
        reader.close();

Edit: The following information might be useful for analyzing the problem:

mvn -v
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"
3
  • show the code where you read the files. Commented Sep 5, 2012 at 0:18
  • Perhaps maven.apache.org/plugins/maven-resources-plugin/examples/… would be of help? Commented Sep 5, 2012 at 5:05
  • Thanks for the suggestion, @afk5min, but if I applied it correctly, this does not solve the issue. I added the maven-resources-plugin with the configuration in the example, but nothing changed. As before, the mvn install results among other messages in the following message: "[INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 10 resources " Why did you think that this would help? Commented Sep 5, 2012 at 5:24

1 Answer 1

4

So, your data file is in UTF-8. The eclipse settings on that file has no bearing, as the running Java program will be the instructions that interpret the meaning.

FileReader always uses the platform default encoding which is generally a bad idea. Eclipse is likely setting the "platorm default" for you, whereas Maven is not.

Fix your code to specify the encoding.

See JavaDoc:

To specify these values yourself, construct an InputStreamReader on a FileInputStream.
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that was the solution. Of course, I also had to change the part where I read the unwanted signs. The BufferedReader is now initiated as: BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filename), Charset.forName("UTF-8"))); For the input file, I should probably implement automatic detection of the encoding as described here: link. I hate being fooled by smart tools.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.