0

I am wondering if there is a simple bash or AWK oneliner to get the number of repeated characters, per repeat.

For example considering this string:

AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA

Is it possible to get the number of Ns in the first repeat, the number of Ns in the second repeat, etc.?

Thanks!

Expected results, the length of each repeat on a new line.

4
  • 2
    What efforts did you make? Post them even if it did not solve your problem Commented Aug 31, 2017 at 10:55
  • At a minimum at least add your expected output - all on one line, spaces or commas between, on separate lines, etc... Commented Aug 31, 2017 at 12:51
  • I was satisfied with the first answer from anubhava, see comments under his answer. I added expected results, as you asked for. Commented Aug 31, 2017 at 12:57
  • We're not looking for a description of the expected results (though it's fine to have that too), we're looking for the actual expected output given the input you posted. This site isn't just for you to get an answer to your question, it's a repository for others to look up their questions to find answers so it's important that a question be a complete one (see How to Ask) to help everyone else in future. Commented Aug 31, 2017 at 13:00

8 Answers 8

7

You can use awk to split fields on each character that not N and print each field and it's length:

s='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print $i, length($i)}' <<< "$s"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Another option is to use grep + awk:

grep -Eo 'N+' <<< "$s" | awk '{print $1, length($1)}'

And here is pure BASH solution:

shopt -s extglob
while read -r line; do
    [[ -n $line ]] && echo "$line ${#line}"
done <<< "${s//+([!N])/$'\n'}"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

BASH solution details:

  1. It uses extended glob pattern to match 1 or more non-N characters and replace them with line break in +([!N])/$'\n'}"
  2. Using a while loop we iterate through each substring of N characters
  3. Inside the loop we print each string and length of that string.
Sign up to request clarification or add additional context in comments.

9 Comments

See working demo What output are you getting?
Another option is to use: grep -Eo 'N{2,}' <<< "$s" | awk '{print $1, length($1)}'
This worked: awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print length($i)}' <<< "$s"
@raam86: Details added in answer.
didn't realize you are referring to s defined earlier, thank you for the detailed answer
|
4

A simple solution:

echo "$string" | grep -oE "N+" | awk '{ print $0, length}'

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

EDIT:
As per suggestion of @Ed-Morton: Changing -P to -E.
Man page of grep says -P is "highly experimental" functionality.
We don't need PCREs to use +, just EREs are sufficient.

4 Comments

You don't need PCREs to use +, just EREs, so use -E instead of -P so your grep isn't relying on "highly experimental" (see the man page!) functionality.
@EdMorton: Thanks Ed. Yeah I'll take care of that from next time. Let me edit too. And performance wise which is better according to you ?
PCREs use a very different algorithm/regexp engine from BREs and EREs to accommodate look ahead/behind/whatever and that engine is much slower even if you don't use any PCRE-specific features so BRE and ERE are faster than PCRE. See swtch.com/~rsc/regexp/regexp1.html for details.
Okay. Yeah makes sense. (y).
3

With GNU awk for multi-char RS:

$ awk -v RS='N+' 'RT{print length(RT)}' file
5
8
7

$ awk -v RS='N+' 'RT{print RT, length(RT)}' file
NNNNN 5
NNNNNNNN 8
NNNNNNN 7

3 Comments

Thanks for your help, but I don't get results from your codes. How to use file?
file is just a file containing the input string shown in your question. You could use echo 'AATGATGGAANNN...' | awk -v RS='N+' 'RT{print length(RT)}' instead. As it says, though, you've got to be using GNU awk.
This could be golfed to $0=length(RT)
2

Here's a Perl one-liner:

perl -ne 'while (m/(.)(\1*)/g) { printf "%5i %s\n", length($2)+1, $1 }' <<<AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA
2 A
1 T
1 G
1 A
1 T
2 G
2 A
5 N
1 G
1 A
1 T
1 A
1 G
2 A
1 C
1 G
1 A
1 T
8 N
1 G
1 A
1 T
2 A
1 T
1 G
1 A
7 N
1 T
1 A
1 G
1 A
1 C
1 T
1 G
1 A

The m/(.)(\1*)/ successively matches as many identical characters as possible, with the /g causing the matching to pick up again on the next iteration for as long as the string still contains something which we have not yet matched. So we are looping over the string in chunks of identical characters, and on each iteration, printing the first character as well as the length of the entire matched string.

The first pair of parentheses capture a character at the beginning of the (remaining unmatched) line, and \1 says to repeat this character. The * quantifier matches this as many times as possible.

If you are interested in just the N:s, you could change the first parenthesis to (N), or you could add a conditional like printf("%7i %s\n", length($2), $1) if ($1 == "N"). Similarly, if you want only hits where there are repeats (more than one occurrence), you can say \1+ instead of \1* or add a conditional like ... if length($2) >= 1.

Comments

1

As you asked for a sed solution, you can use this one if your chains of repeated characters are no longer than 9 characters and if your string doesn't contain any semicolons:

sed 's/$/;NNNNNNNNN0123456789/;:a;s/\(N\+\)\([^;]*;\1.\{9\}\)\(.\)\(.*\)/\2\3\4\n\3/;ta;s/[^\n]*\n//'

Comments

1

try these two:

First one

sed 's/[^N]/ /g' file | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Second One

cat file | tr -c 'N' ' ' | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Comments

0

Short GNU awk approach:

str='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -v FPAT='N+' '{for(i=1;i<=NF;i++) print $i,length($i)}' <<< $str

The output:

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Comments

-1

You could take help of the regular expression method.

This is a solution code I get from the following link

Count occurrences of a char in a string using Bash

needle=","
var="text,text,text,text"

number_of_occurrences=$(grep -o "$needle" <<< "$var" | wc -l)

as you can see we get the number of occurrences of "$needle" pretty easily with the help of WC(word count).

You can loop it to satisfy your demand.

2 Comments

@b.nota I guarantee if you had included the expected output in your question then Kevin wouldn't have misunderstood your requirements and wasted his time posting a solution to a different problem than the one you have (and got himself downvoted for his troubles - not by me).
I didn't downvote either, I appreciate all the help here!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.