5

What is best algorithm for removing duplicate values from a list ? I've tried this:

for (int i = 0; i < AuthorCounter-1; i++)
{
    for (int j = 0; j < AuthorCounter-1; j++)
    {
        if (i != j)
        {
            if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
            {
                AuthorGroupNode.Nodes[j].Remove();
                AuthorCounter--;
            }

        }
    }
}

Here, AuthorGroupNodes is a list on nodes. It did things right to some extent, but not perfect. Any one have better solution ???

10
  • 4
    Best based on what criteria? Lowest CPU, lowest memory, ease of coding? Commented Jul 17, 2012 at 4:25
  • 4
    Do you need order preserved or not? There are many other StackOverflow questions on removing duplicates; most mention the criteria. OBTW for ease of coding: list.Distinct().toList() but that's probably not the fastest. Commented Jul 17, 2012 at 4:26
  • Possible duplicate: stackoverflow.com/questions/1388361/… Commented Jul 17, 2012 at 4:30
  • The .NET Version is 3.5 and Best means having Lowest CPU processing + Memory. Commented Jul 17, 2012 at 4:34
  • Lowest CPU and lowest memory are sometimes at odds with each other. Commented Jul 17, 2012 at 4:38

4 Answers 4

6

Your current algorithm is O(N-squared), which will perform quite poorly for a large list.

If space is not an issue, you could keep a HashSet<int> of hashes of nodes. Traverse the list once. If the hash of the node is in the HashSet, you know this is a duplicate node. Skip it. If the hash is not in the HashSet, add this node to a new list, and add the hash of the node to the HashSet.

This will perform O(N), and requires memory for the original list, for a copy of the list less any duplicates, and for the HashSet. The algorithm is non-destructive.

If you can use Linq, simply do

var distinctList = originalList.Distinct().ToList();

UPDATE

Discovered that's pretty much exactly how Jon Skeet re-implemented Distinct.

public static IEnumerable<TSource> Distinct<TSource>( 
    this IEnumerable<TSource> source) 
{ 
    return source.Distinct(EqualityComparer<TSource>.Default); 
} 

public static IEnumerable<TSource> Distinct<TSource>( 
    this IEnumerable<TSource> source, 
    IEqualityComparer<TSource> comparer) 
{ 
    if (source == null)  
    { 
        throw new ArgumentNullException("source"); 
    } 
    return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default); 
} 

private static IEnumerable<TSource> DistinctImpl<TSource>( 
    IEnumerable<TSource> source, 
    IEqualityComparer<TSource> comparer) 
{ 
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer); 
    foreach (TSource item in source) 
    { 
        if (seenElements.Add(item)) 
        { 
            yield return item; 
        } 
    } 
}

https://codeblog.jonskeet.uk/2010/12/30/reimplementing-linq-to-objects-part-14-distinct/

Sign up to request clarification or add additional context in comments.

Comments

4

This works like a treat:

var xs = new []
{
    2, 3, 2, 4, 3, 3, 5, 6,
};

var ys = xs
    .ToLookup(z => z, z => z)
    .Select(x => x.First());

For your code it would look something like this:

var nodes = AuthorGroupNode.Nodes
    .ToLookup(z => z.Text, z => z)
    .Select(x => x.First())
    .ToArray();

Can't be much simpler than that. :-)

6 Comments

@EricJ. - Distinct won't work on custom objects unless they implement Equals & GetHashCode which we don't know for these node objects.
@Enigmativity What does a tolookup return?Is this going to be faster than hashset approach?
@nawfal - It returns a System.Linq.ILookup<T1, T2> - which is essentially an IEnumerable<IEnumerable<T2>> but keyed by T1.
@nawfal - It also uses a hashset internally (from memory) so it is quite fast. What matters though is if it makes the code clear and if it is fast enough.
@Enigmativity thanks, I get it. I was more interested in speed than code clarity as was my requirement.
|
3

Piggy backing off of Eric J.'s answer... You'll want to implement an EqualityComparer to have complete control over how distinct items are identified.

class Program
{
    static void Main(string[] args)
    {
        var list = new List<SampleClass>();
        // add some items

        var distinctItems = list.Distinct(new SampleClass());
    }
}

public class SampleClass : EqualityComparer<SampleClass>
{
    public string Text { get; set; }

    public override bool Equals(SampleClass x, SampleClass y)
    {
        if (x == null || y == null) return false;
        return x.Text == y.Text;
    }

    public override int GetHashCode(SampleClass obj)
    {
        if (obj == null) return 0;
        if (obj.Text == null) return 0;
        return obj.Text.GetHashCode();
    }
}

more info: http://msdn.microsoft.com/en-us/library/bb338049

Comments

2

You never check the last element of the list, your second for needs to be changed to this to work:

for (int j = 0; j < AuthorCounter; j++)

You are checking each pair of nodes twice. First you'll check when i = 0 and j = 1, then later you'll check when i = 1 and j = 0. There's no need to start j before or equal to i. When i = 0, your inner loop will remove all duplicates of that element so you know AuthorGroupNodes.Nodes[0] is unique. Next time through the outer loop you will be sure that AuthorGroupNodes.Nodes[1] is unique. You can therefore start with j equal to i + 1 and remove your check for i == j. Also when you remove the node, j will still increase to the next node. This will skip the new node at j which is the one after the one you remove, so you should decrement j, or just increment j if you don't remove the node:

for (int j = i + 1; j < AuthorCounter;)
{
    if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
    {
        AuthorGroupNode.Nodes[j].Remove();
        AuthorCounter--;
    }
    else
    {
        j++;
    }
}

You say that works but not perfect, so I'm assuming you're not using a standard List and that your nodes handle their own removal from the list with the Remove() method.

If the list is sorted by the field you're comparing, you could remove the inner for loop altogether and remove any duplicates of your current element until you find a different one:

for (int i = 0; i < AuthorCounter-1;)
{
    if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[i + 1].Text)
    {
        AuthorGroupNode.Nodes[i].Remove();
        AuthorCounter--;
    }
    else
    {
        i++;
    }
}

1 Comment

oh my GOD so many nice answers here, gonna help me in my project +fav! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.