Why does maven give me different utf-8 characters than eclipse (test run in eclipse, fail in maven)?

Question

My current project is concerned with parsing natural language. One test reads text from a file, removes certain characters, and tokenizes the text into single words. The test actually compares the number of unique words. In eclipse, this test is "green", in maven, I get a higher number of words than expected. Comparing the lists of words, I see the following additional words:

acquirer⊙s
card⊙s
institution⊙s
issuer⊙s
provider⊙s
psam⊙s
⊜from⊝
⊜slot⊝
⊜to⊝

Looking at the text source, it contains the following characters which should be filtered away: “ ” ’

This works in eclipse, but not in maven. I am using utf-8. The files seem to be encoded correctly, in the maven pom I specify the following:

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <org.apache.lucene.version>3.6.0</org.apache.lucene.version>
</properties>

Edit: Here is the code that reads the file (which is, according to eclipse, encoded as UTF-8).

        BufferedReader reader = new BufferedReader(
                new FileReader(this.file));
        String line = "";
        while ((line = reader.readLine()) != null) {
            // the csv contains a text and a classification
            String[] reqCatType = line.split(";");
            String reqText = reqCatType[0].trim();
            String reqCategory = reqCatType[1].trim();
            // the tokenizer also removes unwanted characters:
            String[] sentence = this.filter.filterStopWords(this.tokenizer
                    .tokenize(reqText));
            // we use this data to train a machine learning algorithm
            this.dataSet.learn(sentence, reqCategory);
        }
        reader.close();

Edit: The following information might be useful for analyzing the problem:

mvn -v
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"

Perhaps maven.apache.org/plugins/maven-resources-plugin/examples/… would be of help? — Tadas S
– Tadas S, Commented Sep 5, 2012 at 5:05
Thanks for the suggestion, @afk5min, but if I applied it correctly, this does not solve the issue. I added the maven-resources-plugin with the configuration in the example, but nothing changed. As before, the mvn install results among other messages in the following message: "[INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 10 resources " Why did you think that this would help? — oerich
– oerich, Commented Sep 5, 2012 at 5:24

Tony K. · Accepted Answer · 2012-09-05 05:41:47Z

4

So, your data file is in UTF-8. The eclipse settings on that file has no bearing, as the running Java program will be the instructions that interpret the meaning.

FileReader always uses the platform default encoding which is generally a bad idea. Eclipse is likely setting the "platorm default" for you, whereas Maven is not.

Fix your code to specify the encoding.

See JavaDoc:

To specify these values yourself, construct an InputStreamReader on a FileInputStream.

answered Sep 5, 2012 at 5:41

Tony K.

5,62525 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

oerich Over a year ago

Thanks, that was the solution. Of course, I also had to change the part where I read the unwanted signs. The BufferedReader is now initiated as:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filename), Charset.forName("UTF-8")));

For the input file, I should probably implement automatic detection of the encoding as described here: link. I hate being fooled by smart tools.

Collectives™ on Stack Overflow

Why does maven give me different utf-8 characters than eclipse (test run in eclipse, fail in maven)?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related