How to count consecutive (repeated) character in string in bash?

Question

I am wondering if there is a simple bash or AWK oneliner to get the number of repeated characters, per repeat.

For example considering this string:

AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA

Is it possible to get the number of Ns in the first repeat, the number of Ns in the second repeat, etc.?

Thanks!

Expected results, the length of each repeat on a new line.

What efforts did you make? Post them even if it did not solve your problem — Inian
– Inian, Commented Aug 31, 2017 at 10:55
At a minimum at least add your expected output - all on one line, spaces or commas between, on separate lines, etc... — Ed Morton
– Ed Morton, Commented Aug 31, 2017 at 12:51
I was satisfied with the first answer from anubhava, see comments under his answer. I added expected results, as you asked for. — benn
– benn, Commented Aug 31, 2017 at 12:57
We're not looking for a description of the expected results (though it's fine to have that too), we're looking for the actual expected output given the input you posted. This site isn't just for you to get an answer to your question, it's a repository for others to look up their questions to find answers so it's important that a question be a complete one (see How to Ask) to help everyone else in future. — Ed Morton
– Ed Morton, Commented Aug 31, 2017 at 13:00

anubhava · Accepted Answer · 2017-08-31 11:48:35Z

7

You can use awk to split fields on each character that not N and print each field and it's length:

s='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print $i, length($i)}' <<< "$s"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Another option is to use grep + awk:

grep -Eo 'N+' <<< "$s" | awk '{print $1, length($1)}'

And here is pure BASH solution:

shopt -s extglob
while read -r line; do
    [[ -n $line ]] && echo "$line ${#line}"
done <<< "${s//+([!N])/$'\n'}"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

BASH solution details:

It uses extended glob pattern to match 1 or more non-N characters and replace them with line break in +([!N])/$'\n'}"
Using a while loop we iterate through each substring of N characters
Inside the loop we print each string and length of that string.

edited Aug 31, 2017 at 11:48

answered Aug 31, 2017 at 10:57

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

anubhava Over a year ago

See working demo What output are you getting?

anubhava Over a year ago

Another option is to use: grep -Eo 'N{2,}' <<< "$s" | awk '{print $1, length($1)}'

benn Over a year ago

This worked: awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print length($i)}' <<< "$s"

anubhava Over a year ago

@raam86: Details added in answer.

raam86 Over a year ago

didn't realize you are referring to s defined earlier, thank you for the detailed answer

|

Rahul Verma · Accepted Answer · 2017-08-31 13:39:28Z

4

A simple solution:

echo "$string" | grep -oE "N+" | awk '{ print $0, length}'

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

EDIT:
As per suggestion of @Ed-Morton: Changing -P to -E.
Man page of grep says -P is "highly experimental" functionality.
We don't need PCREs to use +, just EREs are sufficient.

edited Aug 31, 2017 at 13:39

answered Aug 31, 2017 at 12:54

Rahul Verma

3,1091 gold badge17 silver badges28 bronze badges

4 Comments

Ed Morton Over a year ago

You don't need PCREs to use +, just EREs, so use -E instead of -P so your grep isn't relying on "highly experimental" (see the man page!) functionality.

Rahul Verma Over a year ago

@EdMorton: Thanks Ed. Yeah I'll take care of that from next time. Let me edit too. And performance wise which is better according to you ?

Ed Morton Over a year ago

PCREs use a very different algorithm/regexp engine from BREs and EREs to accommodate look ahead/behind/whatever and that engine is much slower even if you don't use any PCRE-specific features so BRE and ERE are faster than PCRE. See swtch.com/~rsc/regexp/regexp1.html for details.

Rahul Verma Over a year ago

Okay. Yeah makes sense. (y).

Ed Morton · Accepted Answer · 2017-08-31 13:07:46Z

3

With GNU awk for multi-char RS:

$ awk -v RS='N+' 'RT{print length(RT)}' file
5
8
7

$ awk -v RS='N+' 'RT{print RT, length(RT)}' file
NNNNN 5
NNNNNNNN 8
NNNNNNN 7

answered Aug 31, 2017 at 13:07

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

3 Comments

benn Over a year ago

Thanks for your help, but I don't get results from your codes. How to use file?

Ed Morton Over a year ago

file is just a file containing the input string shown in your question. You could use echo 'AATGATGGAANNN...' | awk -v RS='N+' 'RT{print length(RT)}' instead. As it says, though, you've got to be using GNU awk.

Thor Over a year ago

This could be golfed to $0=length(RT)

tripleee · Accepted Answer · 2017-08-31 11:34:37Z

Here's a Perl one-liner:

perl -ne 'while (m/(.)(\1*)/g) { printf "%5i %s\n", length($2)+1, $1 }' <<<AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA
2 A
1 T
1 G
1 A
1 T
2 G
2 A
5 N
1 G
1 A
1 T
1 A
1 G
2 A
1 C
1 G
1 A
1 T
8 N
1 G
1 A
1 T
2 A
1 T
1 G
1 A
7 N
1 T
1 A
1 G
1 A
1 C
1 T
1 G
1 A

The m/(.)(\1*)/ successively matches as many identical characters as possible, with the /g causing the matching to pick up again on the next iteration for as long as the string still contains something which we have not yet matched. So we are looping over the string in chunks of identical characters, and on each iteration, printing the first character as well as the length of the entire matched string.

The first pair of parentheses capture a character at the beginning of the (remaining unmatched) line, and \1 says to repeat this character. The * quantifier matches this as many times as possible.

If you are interested in just the N:s, you could change the first parenthesis to (N), or you could add a conditional like printf("%7i %s\n", length($2), $1) if ($1 == "N"). Similarly, if you want only hits where there are repeats (more than one occurrence), you can say \1+ instead of \1* or add a conditional like ... if length($2) >= 1.

Johannes Riecken · Accepted Answer · 2017-08-31 11:39:50Z

1

As you asked for a sed solution, you can use this one if your chains of repeated characters are no longer than 9 characters and if your string doesn't contain any semicolons:

sed 's/$/;NNNNNNNNN0123456789/;:a;s/$N\+$$[^;]*;\1.\{9\}$$.$$.*$/\2\3\4\n\3/;ta;s/[^\n]*\n//'

answered Aug 31, 2017 at 11:39

Johannes Riecken

2,51519 silver badges19 bronze badges

Comments

Abhinandan prasad · Accepted Answer · 2017-08-31 12:07:09Z

1

try these two:

First one

sed 's/[^N]/ /g' file | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Second One

cat file | tr -c 'N' ' ' | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

answered Aug 31, 2017 at 12:07

Abhinandan prasad

1,0899 silver badges14 bronze badges

Comments

RomanPerekhrest · Accepted Answer · 2017-08-31 12:01:25Z

0

Short GNU awk approach:

str='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -v FPAT='N+' '{for(i=1;i<=NF;i++) print $i,length($i)}' <<< $str

The output:

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

answered Aug 31, 2017 at 12:01

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Comments

HexaCrop · Accepted Answer · 2017-08-31 11:19:04Z

-1

You could take help of the regular expression method.

This is a solution code I get from the following link

Count occurrences of a char in a string using Bash

needle=","
var="text,text,text,text"

number_of_occurrences=$(grep -o "$needle" <<< "$var" | wc -l)

as you can see we get the number of occurrences of "$needle" pretty easily with the help of WC(word count).

You can loop it to satisfy your demand.

answered Aug 31, 2017 at 11:19

HexaCrop

4,4114 gold badges31 silver badges56 bronze badges

2 Comments

Ed Morton Over a year ago

@b.nota I guarantee if you had included the expected output in your question then Kevin wouldn't have misunderstood your requirements and wasted his time posting a solution to a different problem than the one you have (and got himself downvoted for his troubles - not by me).

benn Over a year ago

I didn't downvote either, I appreciate all the help here!

Collectives™ on Stack Overflow

How to count consecutive (repeated) character in string in bash?

8 Answers 8

9 Comments

4 Comments

3 Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

9 Comments

4 Comments

3 Comments

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related