1

I'm actually developping a parser and I'm stuck on a method.

I need to clean specifics words in some sentences, meaning replacing those by a whitespace or a nullcharacter. For now, I came up with this code:

private void clean(String sentence)
{
    try {
        FileInputStream fis = new FileInputStream(
                ConfigHandler.getDefault(DictionaryType.CLEANING).getDictionaryFile());
        BufferedReader bis = new BufferedReader(new InputStreamReader(fis));
        String read;
        List<String> wordList = new ArrayList<String>();

        while ((read = bis.readLine()) != null) {
            wordList.add(read);
        }
    }
    catch (IOException e) {
        e.printStackTrace();
    }

    for (String s : wordList) {
        if (StringUtils.containsIgnoreCase(sentence, s)) { // this comes from Apache Lang
            sentence = sentence.replaceAll("(?i)" + s + "\\b", " ");
        }
    }

    cleanedList.add(sentence);

} 

But when I look at the output, I got all of the occurences of the word to be replaced in my sentence replaced by a whitespace.

Does anybody can help me out on replacing only the exact words to be replaced on my sentence?

Thanks in advance !

1
  • sentence.replaceAll("(?i)\\b" + s + "\\b", " "); - you omitted the leading \b word boundary. Commented Mar 9, 2016 at 10:48

1 Answer 1

2

There are two problems in your code:

  • You are missing the \b before the string
  • You will run into issues if any of the words from the file has special characters

To fix this problem construct your regex as follows:

sentence = sentence.replaceAll("(?i)\\b\\Q" + s + "\\E\\b", " ");

or

sentence = sentence.replaceAll("(?i)\\b" + Pattern.quote(s) + "\\b", " ");
Sign up to request clarification or add additional context in comments.

5 Comments

I tried your 2 ssolutions and none gave me the right output. The complete regular expression is /\b(my_word)\b/gi. I think the only thing I'm missing in my code is the /gpart, but I do not know if it's implicit or not.
@TimmyMdfck Are you looking for myword literally enclosed in parentheses, e.g. "(brown)" in "Quick (brown) fox"? The solution above assumes that the search for words is verbatim, including all special characters, and treating them as non-special.
Actually I got a list of french words in a *.dat file and a whole text in a *.txt file. My parser will get all sentences that are not questions and paste those in another txt file. After that, the clean method will be used on the output file to erase all the words that are present in the dat file. And there is my problem. I tried with a regexp interpretor (here's the link with all in it: regex101.com/r/cU5lC2/507) and it works as a charm. I don't understand where I'm wrong :(
@TimmyMdfck Are you using the loop the way your code shows, or do you concatenate strings with "|" and use as a single expression? Your regex from regex101 uses parentheses as metacharacter. This means that you should remove \\Q and \\E, and not use Pattern.quote, because your list of words needs to be interpreted as a regex.
For now, I'm doing it with the loop but I was thinking about doing it the same way as in the link.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.