Linux shell script to count occurance of char sequence in a text file?

Question

I have a a large text file (over 70mb) and need to count the number of times a character sequence occurs in the file. I can find plenty of scripts to do this, but NONE OF THEM take in to account that a sequence can start and finish on different lines. For the sake of efficiency (I actually have way more than 1 file I am processing), I can not preprocess the files to remove newlines.

Example: If I am searching for "thisIsTheSequence", the following file would have 3 matches:

asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

Thanks for the help.

You can preprocess the files, just do it in a pipeline before your counting script: strip-newlines | count-matches. — Roger Pate
– Roger Pate, Commented Oct 30, 2009 at 22:04

bdonlan · Accepted Answer · 2009-10-30 22:03:07Z

7

One option:

echo $((`tr -d "\n" < file | sed 's/thisIsTheSequence/\n/g' | wc -l` - 1))

There are probably more efficient methods using utilities outside the core of shell - particularly if you can fit the file in memory.

answered Oct 30, 2009 at 22:03

bdonlan

233k31 gold badges275 silver badges326 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ghostdog74 · Accepted Answer · 2009-10-31 13:50:39Z

2

just one awk script will do, since you will processing a huge file. Doing multiple pipes can slow down things.

#!/bin/bash
awk 'BEGIN{
 search="thisIsTheSequence"
 total=0
}
NR%10==0{
  c=gsub(search,"",s)
  total+=c  
}
NR{ s=s $0 }
END{ 
 c=gsub(search,"",s)
 print "total count: "total+c
}' file

output

$ more file
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasdaasdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

$ ./shell.sh
total count: 9

answered Oct 31, 2009 at 13:50

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

Comments

Artelius · Accepted Answer · 2009-10-30 22:12:16Z

0

Is there ever going to be more than one newline in your sequence?

If not, one solution would be to split your sequence in half and search for the halves (e.g. search for "thisIsTh" and also for "eSequence"), then go back to the occurrences you find and take a "closer look", i.e. strip out the newlines in that area and check for a match.

Basically this is a kind of fast "filtering" of the data to find something interesting.

answered Oct 30, 2009 at 22:12

Artelius

49.3k13 gold badges94 silver badges106 bronze badges

2 Comments

jdc0589 Over a year ago

No, the sequence is 9 characters long. Lines with less than 9 characters are irrelevant to the search

Artelius Over a year ago

In that case, you can search for the two halves of the sequence. If it's broken across two lines then you'll find at least ONE of the halves. This is basically a filtering technique that works well (fast) if the halves themselves are fairly rare. But it's a bit of effort to implement.

Preet Sangha · Accepted Answer · 2009-10-30 22:05:04Z

-1

use something like:

head -n LL filename | tail -n YY | grep text | wc -l

where LL is the last line of the sequence and YY is the number of lines in the sequence (i.e. LL - first line)

answered Oct 30, 2009 at 22:05

Preet Sangha

65.7k20 gold badges152 silver badges227 bronze badges

Collectives™ on Stack Overflow

Linux shell script to count occurance of char sequence in a text file?

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related