3

I need to replace the duplicate characters in a string. I tried using

outputString = str.replaceAll("(.)(?=.*\\1)", "");

This replaces the duplicate characters but the position of the characters changes as shown below.

input

haih

output

aih

But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.

input

aaaassssddddd

output

asd

input

cdddddggggeeccc

output

cdge

How can this be achieved?

6
  • 1
    What's your output if the input is hhaiha ? Commented Nov 8, 2014 at 3:36
  • @AvinashRaj hhaiha will give iha Commented Nov 8, 2014 at 3:38
  • 1
    so you want to remove the duplicates which are present at the first. Then your regex seems correct. Commented Nov 8, 2014 at 3:39
  • characters should only come once in the string and the order in which the characters appear in the string should not change Commented Nov 8, 2014 at 3:41
  • still i don't understand your point. A single example wouldn't be enough, so post some more along with the expected output. Commented Nov 8, 2014 at 3:43

2 Answers 2

5

It seems like your code is leaving the last character, so how about this?

outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai
Sign up to request clarification or add additional context in comments.

3 Comments

No? // ccceeeggggddddc // egdc // cdge
@AvinashRaj output for cdddddggggeeccc is cdge is correct
@J.Titus is there any other way to do this directly by a regex
2

Overview

It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:

  • It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.

  • It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.

Therefore, this answer is for educational purpose, rather than something to be used in production code.

Solution

Here is the one-liner replaceAll regex solution:

String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")

Printing out the regex:

(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)

What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.

However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.

This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).

This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:

"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"

To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).

Can this be done in one replacement without using look-behind?

I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.

For your information, this is for removing all subsequent appearance of a fixed character (h is used here):

.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")

To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.

We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.