How to remove duplicate characters in a string using regex?

Question

I need to replace the duplicate characters in a string. I tried using

outputString = str.replaceAll("(.)(?=.*\\1)", "");

This replaces the duplicate characters but the position of the characters changes as shown below.

input

haih

output

aih

But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.

input

aaaassssddddd

output

asd

input

cdddddggggeeccc

output

cdge

How can this be achieved?

so you want to remove the duplicates which are present at the first. Then your regex seems correct. — Avinash Raj
– Avinash Raj, Commented Nov 8, 2014 at 3:39
characters should only come once in the string and the order in which the characters appear in the string should not change — Johny
– Johny, Commented Nov 8, 2014 at 3:41
still i don't understand your point. A single example wouldn't be enough, so post some more along with the expected output. — Avinash Raj
– Avinash Raj, Commented Nov 8, 2014 at 3:43

J. Titus · Accepted Answer · 2014-11-08 03:51:08Z

5

It seems like your code is leaving the last character, so how about this?

outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai

answered Nov 8, 2014 at 3:51

J. Titus

9,7201 gold badge37 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

J. Titus Over a year ago

No? // ccceeeggggddddc // egdc // cdge

Johny Over a year ago

@AvinashRaj output for cdddddggggeeccc is cdge is correct

Johny Over a year ago

@J.Titus is there any other way to do this directly by a regex

Community · Accepted Answer · 2017-05-23 11:47:12Z

Overview

It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:

It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.

Therefore, this answer is for educational purpose, rather than something to be used in production code.

Solution

Here is the one-liner replaceAll regex solution:

String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")

Printing out the regex:

(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)

What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.

However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.

This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).

This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:

"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"

To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).

Can this be done in one replacement without using look-behind?

I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.

For your information, this is for removing all subsequent appearance of a fixed character (h is used here):

.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")

To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.

We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

Collectives™ on Stack Overflow

How to remove duplicate characters in a string using regex?

input

output

input

output

input

output

2 Answers 2

3 Comments

Overview

Solution

Can this be done in one replacement without using look-behind?

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

input

output

input

output

input

output

2 Answers 2

3 Comments

Overview

Solution

Can this be done in one replacement without using look-behind?

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related