0

I have a string with a list of values separated by semicolon. I need some optimal method to remove duplicates. I have following regular expression:

\b(\w+);(?=.*\b\1;?)

This works, but it fails when there are white spaces. For example aaa bbb;aaa bbb;aaa bbb creates aaa aaa aaa bbb instead of aaa bbb.

3
  • 2
    Does solution need to use regex? Commented Sep 2, 2015 at 10:54
  • No, I don't insist on that. Commented Sep 2, 2015 at 11:03
  • 2
    Can we use Java 8 tools (streams/lambdas)? Commented Sep 2, 2015 at 11:05

4 Answers 4

6

Probably simplest solution would be using Sets - collection which doesn't allow duplicates. Split your string on delimiter, and place it in set.

In Java 8 your code can look like:

String result = Stream.of(yourText.split(";"))          //stream of elements separated by ";"
                      .distinct()                       //remove duplicates in stream
                      .collect(Collectors.joining(";"));//create String joining rest of elements using ";"

Pre Java 8 solution can look like:

public String removeDuplicates(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));

    Iterator<String> it = elements.iterator();

    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }

    return sb.toString();
}
Sign up to request clarification or add additional context in comments.

4 Comments

Cool, looks like LINQ in C#.
Good use of Stream +1
As a .NET programmer I like the second method but unfortunately I cannot use Java 8.
@MartinVolek I updated my answer with pre Java 8 solution.
1

This can be implemented in multiple ways. As already mentioned previously, a HashSet is the right way to go. As you state that you need an "optimal" solution I took the time to optimize and benchmark several implementations.

We start with the pre-Java 8 solution by Pshemo:

public static String distinct0(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));
    Iterator<String> it = elements.iterator();
    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }
    return sb.toString();
}

This implementation uses String.split() which creates an array of Strings. This array is then converted to a List, which is added to a LinkedHashSet. The LinkedHashSet preserves the order in which elements have been added by maintaining an additional linked list. Next, an iterator is used to enumerate the elements from the set, which are then concatenated with a StringBuilder.

We can slightly optimize this method by realizing that we can already build the result while iterating over the individual elements in the input string. It is thus not necessary to store information about the order in which distinct strings have been found. This eliminates the need for a LinkedHashSet (and the Iterator):

public static String distinct1(String elements){
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    for (String value : elements.split(";")) {
        if (set.add(value)) {
            builder.append(set.size() != 1 ? ";" : "").append(value);
        }
    }
    return builder.toString();
}

Next, we can get rid of String.split() and thus avoid creating an intermediate array containing all elements from the input string:

public static String distinct2(String elements){

    char[] array = elements.toCharArray();
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            String value = new String(array, last, (index-last));
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(value);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

Finally, we can get rid of unneccessary memory allocations by not constructing String objects for the individual elements, as the constructor String(array, offset, length) (which is also used by String.split()) will call Arrays.copyOfRange(...) to allocate a new char[]. To avoid this overhead, we can implement a wrapper around the input char[] which implements hashCode() and equals() for a given range. This can be used to detect wether a certain string is already contained in the result. Additionally, this method allows us to use StringBuilder.append(array, offset, length), which simply reads data from the provided array:

public static String distinct3(String elements){

    // Prepare
    final char[] array = elements.toCharArray();
    class Entry {
        final int from;
        final int to;
        final int hash;

        public Entry(int from, int to) {
            this.from = from;
            this.to = to;
            int hash = 0;
            for (int i = from; i < to; i++) {
                hash = 31 * hash + array[i];
            }
            this.hash = hash;
        }

        @Override
        public boolean equals(Object object) {
            Entry other = (Entry)object;
            if (other.to - other.from != this.to - this.from) {
                return false;
            }
            for (int i=0; i < this.to - this.from; i++) {
                if (array[this.from + i] != array[other.from + i]) {
                    return false;
                }
            }
            return true;
        }

        @Override
        public int hashCode() {
            return hash;
        }
    }

    // Remove duplicates
    StringBuilder builder = new StringBuilder();
    Set<Entry> set = new HashSet<Entry>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            Entry value = new Entry(last, index);
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(array, last, index-last);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

I compared these implementations with the following code:

public static void main(String[] args) {

    int REPETITIONS = 10000000;
    String VALUE = ";aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;aaa bbb;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb";

    long time = System.currentTimeMillis();
    String result = null;
    for (int i = 0; i < REPETITIONS; i++) {
        result = distinct0(VALUE);
    }
    System.out.println(result + " - " + (double) (System.currentTimeMillis() - time) /
                                        (double) REPETITIONS + " [ms] per call");
}

Which gave me the following results when running it on my machine with JDK 1.7.0_51:

  • distinct0: 0.0021881 [ms] per call
  • distinct1: 0.0018433 [ms] per call
  • distinct2: 0.0016780 [ms] per call
  • distinct3: 0.0012777 [ms] per call

While being undoubtedly much more complex and much less readable than the original version, the optimized implementation is almost twice as fast. If a simple and readable solution is needed, I would choose either the first or the second implementation, if a fast one is needed, I would choose the last implementation.

Comments

0

You can use

(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)

See demo

Replacing aaa bbb;aaa bbb;aaa bbb with space results in aaa bbb.

All the mutliple consecutive ;s will have to be replaced with 2 post-processing steps:

  • .replaceAll("^;+|;+$", "") - removes the leading/trailing semi-colons
  • .replaceAll(";+",";") - merges all multiple ; into 1 ;.

Here is the final IDEONE demo:

String s = "ccc;aaa bbb;aaa bbb;bbb";
s = s.replaceAll("(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)", "").replaceAll("^;+|;+$", "").replaceAll(";+",";");
System.out.println(s); 

4 Comments

You are very close, but test your solution on aaa;aaa bbb;bbb (OP regex wasn't perfect for few more reasons).
Good point, I think I have got a headache, I will try improve it after lunch.
Thanks for the reply but it doesn't work for aaa bbb;a;aaa bbb.
I updated the code. I do not insist on my solution, I just try to show how it could be done with regex. I like the lambdas and I use them extensively in C#.
0

If optimal method == lesser computational complexity then

Parse the string from the begining, value by value, and create a parallel HashSet with the values you found. When a value exists in the set, you ignore it and go to the next. If a value does not exist in the set, emit it and add to the set.

Find and add at a HashSet are O(1) operations so this algorithm should be O(n).

It is also O(n) on memory consumption, it could be something to consider depending on the input.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.