Java - Remove duplicates from a string

Question

I have a string with a list of values separated by semicolon. I need some optimal method to remove duplicates. I have following regular expression:

\b(\w+);(?=.*\b\1;?)

This works, but it fails when there are white spaces. For example aaa bbb;aaa bbb;aaa bbb creates aaa aaa aaa bbb instead of aaa bbb.

Does solution need to use regex?

Pshemo
– Pshemo

2015-09-02 10:54:29 +00:00
Commented Sep 2, 2015 at 10:54 — Pshemo
– Pshemo, Commented Sep 2, 2015 at 10:54
No, I don't insist on that.

Martin Volek
– Martin Volek

2015-09-02 11:03:07 +00:00
Commented Sep 2, 2015 at 11:03 — Martin Volek
– Martin Volek, Commented Sep 2, 2015 at 11:03
Can we use Java 8 tools (streams/lambdas)?

Pshemo
– Pshemo

2015-09-02 11:05:16 +00:00
Commented Sep 2, 2015 at 11:05 — Pshemo
– Pshemo, Commented Sep 2, 2015 at 11:05

Pshemo · Accepted Answer · 2015-09-02 12:29:43Z

6

Probably simplest solution would be using Sets - collection which doesn't allow duplicates. Split your string on delimiter, and place it in set.

In Java 8 your code can look like:

String result = Stream.of(yourText.split(";"))          //stream of elements separated by ";"
                      .distinct()                       //remove duplicates in stream
                      .collect(Collectors.joining(";"));//create String joining rest of elements using ";"

Pre Java 8 solution can look like:

public String removeDuplicates(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));

    Iterator<String> it = elements.iterator();

    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }

    return sb.toString();
}

edited Sep 2, 2015 at 12:29

answered Sep 2, 2015 at 11:09

Pshemo

125k26 gold badges194 silver badges280 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Wiktor Stribiżew Over a year ago

Cool, looks like LINQ in C#.

anubhava Over a year ago

Good use of Stream +1

Martin Volek Over a year ago

As a .NET programmer I like the second method but unfortunately I cannot use Java 8.

Pshemo Over a year ago

@MartinVolek I updated my answer with pre Java 8 solution.

Frank Blaga · Accepted Answer · 2015-09-02 14:12:23Z

This can be implemented in multiple ways. As already mentioned previously, a HashSet is the right way to go. As you state that you need an "optimal" solution I took the time to optimize and benchmark several implementations.

We start with the pre-Java 8 solution by Pshemo:

public static String distinct0(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));
    Iterator<String> it = elements.iterator();
    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }
    return sb.toString();
}

This implementation uses String.split() which creates an array of Strings. This array is then converted to a List, which is added to a LinkedHashSet. The LinkedHashSet preserves the order in which elements have been added by maintaining an additional linked list. Next, an iterator is used to enumerate the elements from the set, which are then concatenated with a StringBuilder.

We can slightly optimize this method by realizing that we can already build the result while iterating over the individual elements in the input string. It is thus not necessary to store information about the order in which distinct strings have been found. This eliminates the need for a LinkedHashSet (and the Iterator):

public static String distinct1(String elements){
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    for (String value : elements.split(";")) {
        if (set.add(value)) {
            builder.append(set.size() != 1 ? ";" : "").append(value);
        }
    }
    return builder.toString();
}

Next, we can get rid of String.split() and thus avoid creating an intermediate array containing all elements from the input string:

public static String distinct2(String elements){

    char[] array = elements.toCharArray();
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            String value = new String(array, last, (index-last));
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(value);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

Finally, we can get rid of unneccessary memory allocations by not constructing String objects for the individual elements, as the constructor String(array, offset, length) (which is also used by String.split()) will call Arrays.copyOfRange(...) to allocate a new char[]. To avoid this overhead, we can implement a wrapper around the input char[] which implements hashCode() and equals() for a given range. This can be used to detect wether a certain string is already contained in the result. Additionally, this method allows us to use StringBuilder.append(array, offset, length), which simply reads data from the provided array:

public static String distinct3(String elements){

    // Prepare
    final char[] array = elements.toCharArray();
    class Entry {
        final int from;
        final int to;
        final int hash;

        public Entry(int from, int to) {
            this.from = from;
            this.to = to;
            int hash = 0;
            for (int i = from; i < to; i++) {
                hash = 31 * hash + array[i];
            }
            this.hash = hash;
        }

        @Override
        public boolean equals(Object object) {
            Entry other = (Entry)object;
            if (other.to - other.from != this.to - this.from) {
                return false;
            }
            for (int i=0; i < this.to - this.from; i++) {
                if (array[this.from + i] != array[other.from + i]) {
                    return false;
                }
            }
            return true;
        }

        @Override
        public int hashCode() {
            return hash;
        }
    }

    // Remove duplicates
    StringBuilder builder = new StringBuilder();
    Set<Entry> set = new HashSet<Entry>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            Entry value = new Entry(last, index);
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(array, last, index-last);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

I compared these implementations with the following code:

public static void main(String[] args) {

    int REPETITIONS = 10000000;
    String VALUE = ";aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;aaa bbb;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb";

    long time = System.currentTimeMillis();
    String result = null;
    for (int i = 0; i < REPETITIONS; i++) {
        result = distinct0(VALUE);
    }
    System.out.println(result + " - " + (double) (System.currentTimeMillis() - time) /
                                        (double) REPETITIONS + " [ms] per call");
}

Which gave me the following results when running it on my machine with JDK 1.7.0_51:

distinct0: 0.0021881 [ms] per call
distinct1: 0.0018433 [ms] per call
distinct2: 0.0016780 [ms] per call
distinct3: 0.0012777 [ms] per call

While being undoubtedly much more complex and much less readable than the original version, the optimized implementation is almost twice as fast. If a simple and readable solution is needed, I would choose either the first or the second implementation, if a fast one is needed, I would choose the last implementation.

Wiktor Stribiżew · Accepted Answer · 2015-09-02 11:37:55Z

0

You can use

(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)

See demo

Replacing aaa bbb;aaa bbb;aaa bbb with space results in aaa bbb.

All the mutliple consecutive ;s will have to be replaced with 2 post-processing steps:

.replaceAll("^;+|;+$", "") - removes the leading/trailing semi-colons
.replaceAll(";+",";") - merges all multiple ; into 1 ;.

Here is the final IDEONE demo:

String s = "ccc;aaa bbb;aaa bbb;bbb";
s = s.replaceAll("(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)", "").replaceAll("^;+|;+$", "").replaceAll(";+",";");
System.out.println(s);

edited Sep 2, 2015 at 11:37

answered Sep 2, 2015 at 10:54

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

4 Comments

Pshemo Over a year ago

You are very close, but test your solution on aaa;aaa bbb;bbb (OP regex wasn't perfect for few more reasons).

Wiktor Stribiżew Over a year ago

Good point, I think I have got a headache, I will try improve it after lunch.

Martin Volek Over a year ago

Thanks for the reply but it doesn't work for aaa bbb;a;aaa bbb.

Wiktor Stribiżew Over a year ago

I updated the code. I do not insist on my solution, I just try to show how it could be done with regex. I like the lambdas and I use them extensively in C#.

Lucky · Accepted Answer · 2015-09-02 12:23:07Z

0

If optimal method == lesser computational complexity then

Parse the string from the begining, value by value, and create a parallel HashSet with the values you found. When a value exists in the set, you ignore it and go to the next. If a value does not exist in the set, emit it and add to the set.

Find and add at a HashSet are O(1) operations so this algorithm should be O(n).

It is also O(n) on memory consumption, it could be something to consider depending on the input.

edited Sep 2, 2015 at 12:23

Lucky

17.4k20 gold badges121 silver badges156 bronze badges

answered Sep 2, 2015 at 11:41

José

1365 bronze badges

Collectives™ on Stack Overflow

Java - Remove duplicates from a string

4 Answers 4

4 Comments

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related