0

I have a text file which contains almost 100000 entries. All of them are in certain pattern, like
word1 word2 word3 word4

However a number of these entries are duplicate where all the words are same. When I am trying to read and form an array or list of the unique ones, I am using an intermediate hash set to do so. And it works pretty fine.
But what essentially I would like to achieve is only unique entries for word2. As in if word2 is common and all other are different, I would like to keep any one of the entries.
e.g
cat dog lion tiger
cat dog deer bear
mouse rat bear deer
lion tiger cat dog
cat dog deer bear

The desired output in this case would be:
cat dog lion tiger
mouse rat bear deer
lion tiger cat dog

or
cat dog deer bear
mouse rat bear deer
lion tiger cat dog

Currently what the hash set is giving is:

cat dog lion tiger
cat dog deer bear
mouse rat bear deer
lion tiger cat dog

Any suggestions as to how can this be achieved efficiently given the data set is large. Is using regex the only option here? I am using C#.

2 Answers 2

1

go over the data and put the second word in a dictionary to know if it appeared before. Code example:

    string[] file = {   "cat dog lion tiger",
                    "cat dog deer bear",
                    "mouse rat bear deer",
                    "lion tiger cat dog",
                    "cat dog deer bear"};

    Dictionary<string, string> dict = new Dictionary<string, string>();

    List<string> lst = new List<string>();

    foreach (string s in file)
    {
        string[] words = s.Split(' ');
        // assumption - thare are at least 2 words in a line - validate it
        if (!dict.ContainsKey(words[1]))
        {
            lst.Add(s);
            dict.Add(words[1], words[1]);
        }
    }

    foreach (string s1 in lst)
        Console.WriteLine(s1);
Sign up to request clarification or add additional context in comments.

Comments

0

You could create an auxiliar class to store the string and implements the interface IEqualityComparer to HashSet,

Example:

        HashSet<WordsRow> list = new HashSet<WordsRow>(new WordsRow());

        list.Add(new WordsRow("cat dog lion tiger"));
        list.Add(new WordsRow("cat dog deer bear"));
        list.Add(new WordsRow("mouse rat bear deer"));
        list.Add(new WordsRow("lion tiger cat dog"));
        list.Add(new WordsRow("cat dog deer bear"));


        foreach (WordsRow row in list)
        {
            Console.WriteLine(row.Row);
        }

"WordRow class" must contain the following::

public class WordsRow : IEqualityComparer<WordsRow>
{
    public string Row {get; set;}

    public WordsRow() { }

    public WordsRow(string row)
    {
        this.Row = row;                        
    }

    public bool Equals(WordsRow x, WordsRow y)
    {
        return x.Row.Split(' ')[1] == y.Row.Split(' ')[1];
    }

    public int GetHashCode(WordsRow obj)
    {
        return obj.Row.Split(' ')[1].GetHashCode();
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.