1

I am looking for an advanced version of this.

Basically, if I have a file with text:

abc
ghi
fed
jkl
abc
ghi
fed

I want the output to be:(for n=3)

Duplicated Lines
abc
ghi
fed
Times = 2
22
  • What kind of language you are using? Commented Jun 12, 2015 at 15:16
  • I am not looking for a language specific solution, although, for the sake of your question let's say Python. Commented Jun 12, 2015 at 15:18
  • @s4san Do the lines in the output have to be in the order in which they appear in the input? In your example, would it be bad if you got abc\nfed\nghi\n in the output? If not, you could simply use the UNIX utilities sort and uniq. Commented Jun 12, 2015 at 15:24
  • @Jubobs The output needs to be ordered as I have to extract the duplicates. Commented Jun 12, 2015 at 15:26
  • 2
    What does n=3 mean? Commented Jun 12, 2015 at 15:33

2 Answers 2

1

One way is splitting your text based on your n then count the number of your elements that all is depending this counting you can use some data structures that use hash-table like dictionary in python that is much efficient for such tasks.

The task is that you create a dictionary that keeps the keys unique and then loop over the list of splitted text and increase the count of each item every time you see a duplicate.

At last you'll have a dictionary contain the unique items with those count as the values of dictionary.

Some langs like python provides good tools like Counter for count the elements within an iterable and islice for slicing and iterable that returns a generator and is very efficient for long iterables :

>>> from collections import Counter
>>> from itertools import islice

>>> s="""abc
... ghi
... fed
... jkl
... abc
... ghi
... fed"""
>>> sp=s.split()
>>> Counter('\n'.join(islice(sp,i,i+3)) for i in range(len(sp)))
Counter({'abc\nghi\nfed': 2, 'fed': 1, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1})

Or you can do it custom :

>>> a=['\n'.join(sp[i:i+3] for i in range(len(sp))]
>>> a
['abc\nghi\nfed', 'ghi\nfed\njkl', 'fed\njkl\nabc', 'jkl\nabc\nghi', 'abc\nghi\nfed', 'ghi\nfed', 'fed']
>>> d={}
>>> for i in a:
...    if i in d:
...       d[i]+=1
...    else :
...       d[i]=1
... 
>>> d
{'fed': 1, 'abc\nghi\nfed': 2, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1}
>>> 
Sign up to request clarification or add additional context in comments.

4 Comments

So, the keys here would be lines and I start the search with n = 1 and find all duplicates of n=1, then do a recursion by incrementing n?
@s4san No you split your text based on n
Can you edit your answer, it says 'test' so I got confused :)
@s4san Oops sorry, just a typo!
1

So, something like this (in perl):

#!/usr/bin/perl
use strict;
use warnings;

my %seen; 
my @order; 

while ( my $line = <DATA> ) {
   chomp ( $line ); 
   push ( @order, $line ) unless $seen{$line}++; 

}

foreach my $element ( @order ) { 
    print "$element, $seen{$element}\n" if $seen{$element} > 1;
}

__DATA__
abc
ghi
fed
jkl
abc
ghi
fed

This can turn into a shorter snippet by:

perl -e 'while ( <> ) { push ( @order, $_ ) unless $seen{$_}++; } for (@order) {print if $seen{$_} > 1}' myfile

2 Comments

Its too easy using hashes. I think the OP wants a painful solution.
Well, I can always onelinerify it to make it proper "write only" perl.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.