Get unique strings based on a substring in C#

Question

I have a text file which contains almost 100000 entries. All of them are in certain pattern, like
word1 word2 word3 word4

However a number of these entries are duplicate where all the words are same. When I am trying to read and form an array or list of the unique ones, I am using an intermediate hash set to do so. And it works pretty fine.
But what essentially I would like to achieve is only unique entries for word2. As in if word2 is common and all other are different, I would like to keep any one of the entries.
e.g
cat dog lion tiger
cat dog deer bear
mouse rat bear deer
lion tiger cat dog
cat dog deer bear

The desired output in this case would be:
cat dog lion tiger
mouse rat bear deer
lion tiger cat dog

or
cat dog deer bear
mouse rat bear deer
lion tiger cat dog

Currently what the hash set is giving is:

cat dog lion tiger
cat dog deer bear
mouse rat bear deer
lion tiger cat dog

Any suggestions as to how can this be achieved efficiently given the data set is large. Is using regex the only option here? I am using C#.

asafrob · Accepted Answer · 2013-07-21 06:02:18Z

1

go over the data and put the second word in a dictionary to know if it appeared before. Code example:

    string[] file = {   "cat dog lion tiger",
                    "cat dog deer bear",
                    "mouse rat bear deer",
                    "lion tiger cat dog",
                    "cat dog deer bear"};

    Dictionary<string, string> dict = new Dictionary<string, string>();

    List<string> lst = new List<string>();

    foreach (string s in file)
    {
        string[] words = s.Split(' ');
        // assumption - thare are at least 2 words in a line - validate it
        if (!dict.ContainsKey(words[1]))
        {
            lst.Add(s);
            dict.Add(words[1], words[1]);
        }
    }

    foreach (string s1 in lst)
        Console.WriteLine(s1);

answered Jul 21, 2013 at 6:02

asafrob

1,86813 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Henry Herwin Garcia · Accepted Answer · 2013-07-21 07:48:44Z

You could create an auxiliar class to store the string and implements the interface IEqualityComparer to HashSet,

Example:

        HashSet<WordsRow> list = new HashSet<WordsRow>(new WordsRow());

        list.Add(new WordsRow("cat dog lion tiger"));
        list.Add(new WordsRow("cat dog deer bear"));
        list.Add(new WordsRow("mouse rat bear deer"));
        list.Add(new WordsRow("lion tiger cat dog"));
        list.Add(new WordsRow("cat dog deer bear"));


        foreach (WordsRow row in list)
        {
            Console.WriteLine(row.Row);
        }

"WordRow class" must contain the following::

public class WordsRow : IEqualityComparer<WordsRow>
{
    public string Row {get; set;}

    public WordsRow() { }

    public WordsRow(string row)
    {
        this.Row = row;                        
    }

    public bool Equals(WordsRow x, WordsRow y)
    {
        return x.Row.Split(' ')[1] == y.Row.Split(' ')[1];
    }

    public int GetHashCode(WordsRow obj)
    {
        return obj.Row.Split(' ')[1].GetHashCode();
    }
}

Collectives™ on Stack Overflow

Get unique strings based on a substring in C#

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related