Algorithm to efficiently identify duplicate in an array of strings in C++

Question

I have a list/array of IP address as string. I need to identify if there are any duplicates in this array and log an error. The array is about 20 elements big. What is an efficient way to identify a duplicate ?

Use a data structure with overloaded == and < operators instead, and store the IPs in a set (preferably hashed). — Columbo
– Columbo, Commented Dec 7, 2014 at 13:15
Checking every element against every other element is only 200 comparisons. This is a pretty small problem; why does it need to be efficient? — Alan Stokes
– Alan Stokes, Commented Dec 7, 2014 at 13:19
@AlanStokes - you are right. I am trying to write a generic code so that if tomorrow the number of elements to be compared increases I would not have to rewrite the code. — Shrikanth N
– Shrikanth N, Commented Dec 7, 2014 at 13:57

elcuco · Accepted Answer · 2014-12-07 13:21:04Z

2

sort original array
iterate over sorted array, and count different values
create new array with size of (2)
copy values from original to new array, skipping duplicates

pseudo in bash:

[user@linux ~]$ cat 1.txt
1
2
3
66
1
1
66
3
7
7
7
7
26

[user@linux ~]$ cat 1.txt | sort | uniq

1
2
26
3
66
7
[user@linux ~]$ cat 1.txt | sort | uniq | wc -l
       7

answered Dec 7, 2014 at 13:21

elcuco

9,2469 gold badges50 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

elcuco Over a year ago

No, just too lazy to write C++ code, and instead giving the guy a general idea... :)

E_net4 Over a year ago

I'm afraid that won't cut it. One might as well hand over a C++ book as an answer, huh?

elcuco Over a year ago

On 3rd tought you are probably right. The complexity of my solution is very hard ~ (O(n log n) * 2N), and I think that the hash map solution given above is much better suited. Leaving this for the whole internet to know how lame this answer is.

E_net4 Over a year ago

Perhaps you're better off removing it before someone downvotes it. ;)

elcuco Over a year ago

I am committed to my stupidity. No. Also - this is a good example of what no to do.

|

6502 · Accepted Answer · 2014-12-07 13:23:07Z

2

You can use a map<string, int> to mark used addresses and where an address appeared first:

void check_dups(const std::vector<std::string>& addresses) {
    std::map<std::string, int> seen;
    for (int i=0,n=addresses.size(); i<n; i++) {
        std::map<std::string, int>::iterator it = seen.find(addreses[i]);
        if (it == seen.end()) {
            // Never used before, mark the position
            seen[addresses[i]] = i;
        } else {
            // Duplicated value, emit a warning
            std::cout << "Duplicate address at index " << i <<
                         " (present already at index " << it->second << ")\n";
        }
    }
}

answered Dec 7, 2014 at 13:23

6502

115k17 gold badges177 silver badges277 bronze badges

6 Comments

Kris Over a year ago

Nice, but std::map is implemented using red-black trees, so inserting into it has complexity O(log n). This means that overall we would get O(n logn), no better than simply sorting the array and looking for adjacent equal elements.

E_net4 Over a year ago

@Krystian If that is your main concern, just use std::unordered_map.

Kris Over a year ago

E_net4: but you need C++11 support to use std::undordered_map

E_net4 Over a year ago

@Krystian C++11 is a mature standard right now. It is only very likely that it is supported in OP's compiler.

6502 Over a year ago

@Krystian: as an error message are you proposing "There are some duplicates somewhere, good luck finding them..."?

|

Richard Hodges · Accepted Answer · 2014-12-07 15:31:01Z

here are 3 reasonably efficient ways, from the top of my head:

#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
#include <set>

// returns a sorted, de-duplicated copy
std::vector<std::string> de_duplicated(std::vector<std::string> vec)
{
    std::set<std::string> interim { vec.begin(), vec.end() };
    vec.assign(interim.begin(), interim.end());
    return vec;
}

// sorts and de-duplicates in place
void de_duplicate(std::vector<std::string>& vec)
{
    std::sort(std::begin(vec), std::end(vec));

    auto current = std::begin(vec);

    do {
        auto last = std::end(vec);
        current = std::adjacent_find(current, last);
        if (current != last) {
            auto last_same = std::find_if_not(std::next(current),
                                              last,
                                              [&current](const std::string& s) {
                                                  return s == *current;
                                              });
            current = vec.erase(std::next(current), last_same);
        }
    } while(current != std::end(vec));

}

// returns a de-duplicated copy, preserving order
std::vector<std::string> de_duplicated_stable(const std::vector<std::string>& vec)
{
    std::set<std::string> index;
    std::vector<std::string> result;
    for (const auto& s : vec) {
        if (index.insert(s).second) {
            result.push_back(s);
        }
    }

    return result;
}





using namespace std;


int main() {

    std::vector<std::string> addresses { "d", "a", "c", "d", "c", "a", "c", "d" };

    cout << "before" << endl;
    std::copy(begin(addresses), end(addresses), ostream_iterator<string>(cout, ", "));
    cout << endl;

    auto deduplicated = de_duplicated(addresses);
    cout << endl << "sorted, de-duplicated copy" << endl;
    std::copy(begin(deduplicated), end(deduplicated), ostream_iterator<string>(cout, ", "));
    cout << endl;

    deduplicated = de_duplicated_stable(addresses);
    cout << endl << "sorted, stable copy" << endl;
    std::copy(begin(deduplicated), end(deduplicated), ostream_iterator<string>(cout, ", "));
    cout << endl;

    de_duplicate(addresses);
    cout << endl << "sorted, de-duplicated in-place" << endl;
    std::copy(begin(addresses), end(addresses), ostream_iterator<string>(cout, ", "));
    cout << endl;

    return 0;
}

Collectives™ on Stack Overflow

Algorithm to efficiently identify duplicate in an array of strings in C++

3 Answers 3

6 Comments

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related