I have a text file which contains almost 100000 entries. All of them are in certain pattern, like
word1 word2 word3 word4
However a number of these entries are duplicate where all the words are same. When I am trying to read and form an array or list of the unique ones, I am using an intermediate hash set to do so. And it works pretty fine.
But what essentially I would like to achieve is only unique entries for word2. As in if word2 is common and all other are different, I would like to keep any one of the entries.
e.g
cat dog lion tiger
cat dog deer bear
mouse rat bear deer
lion tiger cat dog
cat dog deer bear
The desired output in this case would be:
cat dog lion tiger
mouse rat bear deer
lion tiger cat dog
or
cat dog deer bear
mouse rat bear deer
lion tiger cat dog
Currently what the hash set is giving is:
cat dog lion tiger
cat dog deer bear
mouse rat bear deer
lion tiger cat dog
Any suggestions as to how can this be achieved efficiently given the data set is large. Is using regex the only option here? I am using C#.