This can be implemented in multiple ways. As already mentioned previously, a HashSet is the right way to go. As you state that you need an "optimal" solution I took the time to optimize and benchmark several implementations.
We start with the pre-Java 8 solution by Pshemo:
public static String distinct0(String yourText) {
Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));
Iterator<String> it = elements.iterator();
StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
while (it.hasNext()) {
sb.append(';').append(it.next());
}
return sb.toString();
}
This implementation uses String.split() which creates an array of Strings. This array is then converted to a List, which is added to a LinkedHashSet. The LinkedHashSet preserves the order in which elements have been added by maintaining an additional linked list. Next, an iterator is used to enumerate the elements from the set, which are then concatenated with a StringBuilder.
We can slightly optimize this method by realizing that we can already build the result while iterating over the individual elements in the input string. It is thus not necessary to store information about the order in which distinct strings have been found. This eliminates the need for a LinkedHashSet (and the Iterator):
public static String distinct1(String elements){
StringBuilder builder = new StringBuilder();
Set<String> set = new HashSet<String>();
for (String value : elements.split(";")) {
if (set.add(value)) {
builder.append(set.size() != 1 ? ";" : "").append(value);
}
}
return builder.toString();
}
Next, we can get rid of String.split() and thus avoid creating an intermediate array containing all elements from the input string:
public static String distinct2(String elements){
char[] array = elements.toCharArray();
StringBuilder builder = new StringBuilder();
Set<String> set = new HashSet<String>();
int last = 0;
for (int index=0; index<array.length; index++) {
if (array[index] == ';') {
String value = new String(array, last, (index-last));
if (set.add(value)) {
builder.append(last != 0 ? ";" : "").append(value);
}
last = index + 1;
}
}
return builder.toString();
}
Finally, we can get rid of unneccessary memory allocations by not constructing String objects for the individual elements, as the constructor String(array, offset, length) (which is also used by String.split()) will call Arrays.copyOfRange(...) to allocate a new char[]. To avoid this overhead, we can implement a wrapper around the input char[] which implements hashCode() and equals() for a given range. This can be used to detect wether a certain string is already contained in the result. Additionally, this method allows us to use StringBuilder.append(array, offset, length), which simply reads data from the provided array:
public static String distinct3(String elements){
// Prepare
final char[] array = elements.toCharArray();
class Entry {
final int from;
final int to;
final int hash;
public Entry(int from, int to) {
this.from = from;
this.to = to;
int hash = 0;
for (int i = from; i < to; i++) {
hash = 31 * hash + array[i];
}
this.hash = hash;
}
@Override
public boolean equals(Object object) {
Entry other = (Entry)object;
if (other.to - other.from != this.to - this.from) {
return false;
}
for (int i=0; i < this.to - this.from; i++) {
if (array[this.from + i] != array[other.from + i]) {
return false;
}
}
return true;
}
@Override
public int hashCode() {
return hash;
}
}
// Remove duplicates
StringBuilder builder = new StringBuilder();
Set<Entry> set = new HashSet<Entry>();
int last = 0;
for (int index=0; index<array.length; index++) {
if (array[index] == ';') {
Entry value = new Entry(last, index);
if (set.add(value)) {
builder.append(last != 0 ? ";" : "").append(array, last, index-last);
}
last = index + 1;
}
}
return builder.toString();
}
I compared these implementations with the following code:
public static void main(String[] args) {
int REPETITIONS = 10000000;
String VALUE = ";aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
"aaa bbb;;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
"aaa bbb;aaa bbb;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb";
long time = System.currentTimeMillis();
String result = null;
for (int i = 0; i < REPETITIONS; i++) {
result = distinct0(VALUE);
}
System.out.println(result + " - " + (double) (System.currentTimeMillis() - time) /
(double) REPETITIONS + " [ms] per call");
}
Which gave me the following results when running it on my machine with JDK 1.7.0_51:
- distinct0: 0.0021881 [ms] per call
- distinct1: 0.0018433 [ms] per call
- distinct2: 0.0016780 [ms] per call
- distinct3: 0.0012777 [ms] per call
While being undoubtedly much more complex and much less readable than the original version, the optimized implementation is almost twice as fast. If a simple and readable solution is needed, I would choose either the first or the second implementation, if a fast one is needed, I would choose the last implementation.