Regex/Algorithm to find 'n' repeated lines in a file

Question

I am looking for an advanced version of this.

Basically, if I have a file with text:

abc
ghi
fed
jkl
abc
ghi
fed

I want the output to be:(for n=3)

Duplicated Lines
abc
ghi
fed
Times = 2

I am not looking for a language specific solution, although, for the sake of your question let's say Python. — s4san
– s4san, Commented Jun 12, 2015 at 15:18
@s4san Do the lines in the output have to be in the order in which they appear in the input? In your example, would it be bad if you got abc\nfed\nghi\n in the output? If not, you could simply use the UNIX utilities sort and uniq. — jub0bs
– jub0bs, Commented Jun 12, 2015 at 15:24
@Jubobs The output needs to be ordered as I have to extract the duplicates. — s4san
– s4san, Commented Jun 12, 2015 at 15:26

Kasravnd · Accepted Answer · 2015-06-12 15:44:01Z

1

One way is splitting your text based on your n then count the number of your elements that all is depending this counting you can use some data structures that use hash-table like dictionary in python that is much efficient for such tasks.

The task is that you create a dictionary that keeps the keys unique and then loop over the list of splitted text and increase the count of each item every time you see a duplicate.

At last you'll have a dictionary contain the unique items with those count as the values of dictionary.

Some langs like python provides good tools like Counter for count the elements within an iterable and islice for slicing and iterable that returns a generator and is very efficient for long iterables :

>>> from collections import Counter
>>> from itertools import islice

>>> s="""abc
... ghi
... fed
... jkl
... abc
... ghi
... fed"""
>>> sp=s.split()
>>> Counter('\n'.join(islice(sp,i,i+3)) for i in range(len(sp)))
Counter({'abc\nghi\nfed': 2, 'fed': 1, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1})

Or you can do it custom :

>>> a=['\n'.join(sp[i:i+3] for i in range(len(sp))]
>>> a
['abc\nghi\nfed', 'ghi\nfed\njkl', 'fed\njkl\nabc', 'jkl\nabc\nghi', 'abc\nghi\nfed', 'ghi\nfed', 'fed']
>>> d={}
>>> for i in a:
...    if i in d:
...       d[i]+=1
...    else :
...       d[i]=1
... 
>>> d
{'fed': 1, 'abc\nghi\nfed': 2, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1}
>>>

edited Jun 12, 2015 at 15:44

answered Jun 12, 2015 at 15:25

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

s4san Over a year ago

So, the keys here would be lines and I start the search with n = 1 and find all duplicates of n=1, then do a recursion by incrementing n?

Kasravnd Over a year ago

@s4san No you split your text based on n

s4san Over a year ago

Can you edit your answer, it says 'test' so I got confused :)

Kasravnd Over a year ago

@s4san Oops sorry, just a typo!

Sobrique · Accepted Answer · 2015-06-12 15:53:39Z

1

So, something like this (in perl):

#!/usr/bin/perl
use strict;
use warnings;

my %seen; 
my @order; 

while ( my $line = <DATA> ) {
   chomp ( $line ); 
   push ( @order, $line ) unless $seen{$line}++; 

}

foreach my $element ( @order ) { 
    print "$element, $seen{$element}\n" if $seen{$element} > 1;
}

__DATA__
abc
ghi
fed
jkl
abc
ghi
fed

This can turn into a shorter snippet by:

perl -e 'while ( <> ) { push ( @order, $_ ) unless $seen{$_}++; } for (@order) {print if $seen{$_} > 1}' myfile

edited Jun 12, 2015 at 15:53

answered Jun 12, 2015 at 15:45

Sobrique

53.6k8 gold badges63 silver badges107 bronze badges

2 Comments

user557597 Over a year ago

Its too easy using hashes. I think the OP wants a painful solution.

Sobrique Over a year ago

Well, I can always onelinerify it to make it proper "write only" perl.

Collectives™ on Stack Overflow

Regex/Algorithm to find 'n' repeated lines in a file

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related