Pattern, matcher in Java, REGEX help

Question

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test.

Why is my output This is?

Edit:

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
}

but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

What's i in that second fragment? There is no trace of it anywhere else in the code you show... — Alex Martelli
– Alex Martelli, Commented Aug 4, 2010 at 4:48
Hi, Crystal. It is best to ask a new question in that case, it really is another question on another subject. (on a relevant note - back when I studied Java it didn't have generics nor foreach loops :P) — Kobi
– Kobi, Commented Aug 6, 2010 at 9:26

Kobi · Accepted Answer · 2010-08-04 04:58:43Z

3

Try something like:

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

edited Aug 4, 2010 at 4:58

answered Aug 4, 2010 at 4:52

Kobi

139k41 gold badges259 silver badges302 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Crystal Over a year ago

That helped out a lot. Is there a way to check for things that are different case? Like "test Test"?

Kobi Over a year ago

@Crystal - Thanks! You can add (?i) at the beginning of the regex to make it case-insensitive, it seems like the standard solution for replaceAll.

Crystal Over a year ago

Another question Kobi if you have a second, if I am looping through an Arraylist that has my lines of words from a test file, and if I did a foreach loop to go through it, like for (String s: lineOfWords) { s = s.replaceAll..., then how would I add this new "s" to my new ArrayList to return. I think it has to do with shallow vs deep copy, but not sure. I tried pseudo-coding in my initial question above. Thx!

tchrist Over a year ago

You mustn’t use \b and such in Java. They are super-broken. For example, the string élève is not matched by the pattern \b\w+\b anywhere whatsoever.

Kobi Over a year ago

@tchrist - Hello! Yes, I've noticed you raise that unfortunate issue lately. I'll keep it in mind when Unicode support is necessary. I guess the best workaround here is not to use a monstrosity of a regex for every \b or \w, but to use a regex library that works :P

John Kugelman · Accepted Answer · 2010-08-04 04:51:31Z

1

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.

To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.

answered Aug 4, 2010 at 4:51

John Kugelman

365k70 gold badges555 silver badges600 bronze badges

Collectives™ on Stack Overflow

Pattern, matcher in Java, REGEX help

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related