1

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test.

Why is my output This is?

Edit:

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
} 

but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

3
  • What's i in that second fragment? There is no trace of it anywhere else in the code you show... Commented Aug 4, 2010 at 4:48
  • sorry, i is equal to 0, added it back in. Commented Aug 4, 2010 at 4:50
  • Hi, Crystal. It is best to ask a new question in that case, it really is another question on another subject. (on a relevant note - back when I studied Java it didn't have generics nor foreach loops :P) Commented Aug 6, 2010 at 9:26

2 Answers 2

3

Try something like:

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

Sign up to request clarification or add additional context in comments.

5 Comments

That helped out a lot. Is there a way to check for things that are different case? Like "test Test"?
@Crystal - Thanks! You can add (?i) at the beginning of the regex to make it case-insensitive, it seems like the standard solution for replaceAll.
Another question Kobi if you have a second, if I am looping through an Arraylist that has my lines of words from a test file, and if I did a foreach loop to go through it, like for (String s: lineOfWords) { s = s.replaceAll..., then how would I add this new "s" to my new ArrayList to return. I think it has to do with shallow vs deep copy, but not sure. I tried pseudo-coding in my initial question above. Thx!
You mustn’t use \b and such in Java. They are super-broken. For example, the string élève is not matched by the pattern \b\w+\b anywhere whatsoever.
@tchrist - Hello! Yes, I've noticed you raise that unfortunate issue lately. I'll keep it in mind when Unicode support is necessary. I guess the best workaround here is not to use a monstrosity of a regex for every \b or \w, but to use a regex library that works :P
1

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.

To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.