Best Algorithm for Removing Duplicate Values from a List

Question

What is best algorithm for removing duplicate values from a list ? I've tried this:

for (int i = 0; i < AuthorCounter-1; i++)
{
    for (int j = 0; j < AuthorCounter-1; j++)
    {
        if (i != j)
        {
            if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
            {
                AuthorGroupNode.Nodes[j].Remove();
                AuthorCounter--;
            }

        }
    }
}

Here, AuthorGroupNodes is a list on nodes. It did things right to some extent, but not perfect. Any one have better solution ???

Best based on what criteria? Lowest CPU, lowest memory, ease of coding? — Eric J.
– Eric J., Commented Jul 17, 2012 at 4:25
Do you need order preserved or not? There are many other StackOverflow questions on removing duplicates; most mention the criteria. OBTW for ease of coding: list.Distinct().toList() but that's probably not the fastest. — Ray Toal
– Ray Toal, Commented Jul 17, 2012 at 4:26
The .NET Version is 3.5 and Best means having Lowest CPU processing + Memory. — Itz.Irshad
– Itz.Irshad, Commented Jul 17, 2012 at 4:34
Lowest CPU and lowest memory are sometimes at odds with each other. — Eric J.
– Eric J., Commented Jul 17, 2012 at 4:38

SiHa · Accepted Answer · 2017-01-24 17:18:26Z

Your current algorithm is O(N-squared), which will perform quite poorly for a large list.

If space is not an issue, you could keep a HashSet<int> of hashes of nodes. Traverse the list once. If the hash of the node is in the HashSet, you know this is a duplicate node. Skip it. If the hash is not in the HashSet, add this node to a new list, and add the hash of the node to the HashSet.

This will perform O(N), and requires memory for the original list, for a copy of the list less any duplicates, and for the HashSet. The algorithm is non-destructive.

If you can use Linq, simply do

var distinctList = originalList.Distinct().ToList();

UPDATE

Discovered that's pretty much exactly how Jon Skeet re-implemented Distinct.

public static IEnumerable<TSource> Distinct<TSource>( 
    this IEnumerable<TSource> source) 
{ 
    return source.Distinct(EqualityComparer<TSource>.Default); 
} 

public static IEnumerable<TSource> Distinct<TSource>( 
    this IEnumerable<TSource> source, 
    IEqualityComparer<TSource> comparer) 
{ 
    if (source == null)  
    { 
        throw new ArgumentNullException("source"); 
    } 
    return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default); 
} 

private static IEnumerable<TSource> DistinctImpl<TSource>( 
    IEnumerable<TSource> source, 
    IEqualityComparer<TSource> comparer) 
{ 
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer); 
    foreach (TSource item in source) 
    { 
        if (seenElements.Add(item)) 
        { 
            yield return item; 
        } 
    } 
}

https://codeblog.jonskeet.uk/2010/12/30/reimplementing-linq-to-objects-part-14-distinct/

Enigmativity · Accepted Answer · 2012-07-17 04:52:01Z

4

This works like a treat:

var xs = new []
{
    2, 3, 2, 4, 3, 3, 5, 6,
};

var ys = xs
    .ToLookup(z => z, z => z)
    .Select(x => x.First());

For your code it would look something like this:

var nodes = AuthorGroupNode.Nodes
    .ToLookup(z => z.Text, z => z)
    .Select(x => x.First())
    .ToArray();

Can't be much simpler than that. :-)

answered Jul 17, 2012 at 4:52

Enigmativity

117k12 gold badges101 silver badges184 bronze badges

6 Comments

Enigmativity Over a year ago

@EricJ. - Distinct won't work on custom objects unless they implement Equals & GetHashCode which we don't know for these node objects.

nawfal Over a year ago

@Enigmativity What does a tolookup return?Is this going to be faster than hashset approach?

Enigmativity Over a year ago

@nawfal - It returns a System.Linq.ILookup<T1, T2> - which is essentially an IEnumerable<IEnumerable<T2>> but keyed by T1.

Enigmativity Over a year ago

@nawfal - It also uses a hashset internally (from memory) so it is quite fast. What matters though is if it makes the code clear and if it is fast enough.

nawfal Over a year ago

@Enigmativity thanks, I get it. I was more interested in speed than code clarity as was my requirement.

|

Glenn Ferrie · Accepted Answer · 2012-07-17 04:34:50Z

Piggy backing off of Eric J.'s answer... You'll want to implement an EqualityComparer to have complete control over how distinct items are identified.

class Program
{
    static void Main(string[] args)
    {
        var list = new List<SampleClass>();
        // add some items

        var distinctItems = list.Distinct(new SampleClass());
    }
}

public class SampleClass : EqualityComparer<SampleClass>
{
    public string Text { get; set; }

    public override bool Equals(SampleClass x, SampleClass y)
    {
        if (x == null || y == null) return false;
        return x.Text == y.Text;
    }

    public override int GetHashCode(SampleClass obj)
    {
        if (obj == null) return 0;
        if (obj.Text == null) return 0;
        return obj.Text.GetHashCode();
    }
}

more info: http://msdn.microsoft.com/en-us/library/bb338049

Jason Goemaat · Accepted Answer · 2012-07-17 04:42:26Z

You never check the last element of the list, your second for needs to be changed to this to work:

for (int j = 0; j < AuthorCounter; j++)

You are checking each pair of nodes twice. First you'll check when i = 0 and j = 1, then later you'll check when i = 1 and j = 0. There's no need to start j before or equal to i. When i = 0, your inner loop will remove all duplicates of that element so you know AuthorGroupNodes.Nodes[0] is unique. Next time through the outer loop you will be sure that AuthorGroupNodes.Nodes[1] is unique. You can therefore start with j equal to i + 1 and remove your check for i == j. Also when you remove the node, j will still increase to the next node. This will skip the new node at j which is the one after the one you remove, so you should decrement j, or just increment j if you don't remove the node:

for (int j = i + 1; j < AuthorCounter;)
{
    if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
    {
        AuthorGroupNode.Nodes[j].Remove();
        AuthorCounter--;
    }
    else
    {
        j++;
    }
}

You say that works but not perfect, so I'm assuming you're not using a standard List and that your nodes handle their own removal from the list with the Remove() method.

If the list is sorted by the field you're comparing, you could remove the inner for loop altogether and remove any duplicates of your current element until you find a different one:

for (int i = 0; i < AuthorCounter-1;)
{
    if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[i + 1].Text)
    {
        AuthorGroupNode.Nodes[i].Remove();
        AuthorCounter--;
    }
    else
    {
        i++;
    }
}

oh my GOD so many nice answers here, gonna help me in my project +fav! :)

Collectives™ on Stack Overflow

Best Algorithm for Removing Duplicate Values from a List

4 Answers 4

Comments

6 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

6 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related