How can I output the number of repeats of a pattern in regex?

Question

I would like to output the number of repeats of a pattern with regex. For example, convert "aaad" to "3xad", "bCCCCC" to "b5xC". I want to do this in sed or awk.

I know I can match it by (.)\1+ or even capture it by ((.)\1+). But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?

Counting in sed is extremely cumbersome, see for example gnu.org/software/sed/manual/sed.html#wc-_002dc — Benjamin W.
– Benjamin W., Commented Sep 19, 2018 at 12:22
@BenjaminW. Thanks never know counting is that difficult in sed. Any other program can do this easily? — Wang
– Wang, Commented Sep 19, 2018 at 12:24

choroba · Accepted Answer · 2018-09-19 12:29:41Z

4

Perl to the rescue!

perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'

-p reads the input line by line and prints it after processing
s/// is the substitution similar to sed
/e makes the replacement evaluated as code

e.g.

aaadbCCCCCxx -> 3xadb5xC2xx

answered Sep 19, 2018 at 12:29

choroba

245k27 gold badges221 silver badges304 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

karakfa Over a year ago

sed on steroids!

James Brown · Accepted Answer · 2018-09-20 05:47:23Z

2

In GNU awk:

$ echo aaadbCCCCCxx |  awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) {
        c=$i
        match(substr($0,i),c"+")
        b=b (RLENGTH>1?RLENGTH "x":"") c
    }
    print b
}'
3xadb5xC2xx

If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):

$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) { 
        c=$i                               
        # print i,c                        # for debugging
        if(c~/[*.\\]/)                     # if c is a regex metachar (not complete)
            c="\\"c                        # escape it
        match(substr($0,i),c"+")           # find all c:s
        b=b (RLENGTH>1?RLENGTH "x":"") $i  # buffer to b
    }
    print b
}'
3x\2x.2x*3xadb5xC2x+2xx

edited Sep 20, 2018 at 5:47

answered Sep 19, 2018 at 13:58

James Brown

37.7k8 gold badges52 silver badges64 bronze badges

3 Comments

Ed Morton Over a year ago

That would misbehave if $i was a regexp metachar such as . or an escape char `\`. It's unclear if the OP can have non-alphabetic chars in their input or not though so idk if it's a real issue or not.

James Brown Over a year ago

... AND it supports regex... ;D

Ed Morton Over a year ago

wrt the 2nd script - escaping characters turns some of them into the chars they represent when escaped rather than literal, e.g. t -> \t = <tab>. Try printf 'foo\tbar\n' | awk '{c="t"; c="\\"c; print match($0,c)}'. You need to put all chars except ^ inside square brackets instead and you need to escape only ^. See the answers at stackoverflow.com/q/29613304/1745001 which do this job for sed.

Thor · Accepted Answer · 2018-09-19 14:05:14Z

1

Just for fun.

With sed it is cumbersome but do-able. Note this example relies on GNU sed (:

parse.sed

/(.)\1+/ {
  : nextrepetition
  /((.)\2+)/ s//\n\1\n/             # delimit the repetition with new-lines
  h                                 # and store the delimited version
  s/^[^\n]*\n|\n[^\n]*$//g          # now remove prefix and suffix
  b charcount                       # count repetitions
  : aftercharcount                  # return here after counting
  G                                 # append the new-line delimited version

  # Reorganize pattern space to the desired format
  s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/

  # Run again if more repetitions exist
  /(.)\1+/b nextrepetition
}

b

# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount

s/./a/g

# Do the carry.  The t's and b's are not necessary,
# but they do speed up the thing
t a
: a;  s/aaaaaaaaaa/b/g; t b; b done
: b;  s/bbbbbbbbbb/c/g; t c; b done
: c;  s/cccccccccc/d/g; t d; b done
: d;  s/dddddddddd/e/g; t e; b done
: e;  s/eeeeeeeeee/f/g; t f; b done
: f;  s/ffffffffff/g/g; t g; b done
: g;  s/gggggggggg/h/g; t h; b done
: h;  s/hhhhhhhhhh//g

: done

# On the last line, convert back to decimal

: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/

y/bcdefgh/abcdefg/
/[a-h]/ b loop

b aftercharcount

Run it like this:

sed -Ef parse.sed infile

With an infile like this:

aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The output is:

3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

answered Sep 19, 2018 at 14:05

Thor

47.7k12 gold badges125 silver badges140 bronze badges

5 Comments

Ed Morton Over a year ago

echo 'xx' | sed -Ef parse.sed seems to send that into an infinite loop.

Thor Over a year ago

@EdMorton: this stems from the choice of repetition indicator (x) and that my solution looks at the whole string after each replacement. Either choose a different indicator or modify the solution to only look at the rest of the string

Ed Morton Over a year ago

That was complete and utter dumb luck, I had no idea there was anything special about an x! Are there any other characters or strings that aren't allowed to appear in the input? I couldn't modify that script if I wanted to - way too complicated for my sed abilities!

Thor Over a year ago

@EdMorton: No. Thinking about the repetition indicator issue, I realized that having plural lettered repetitions with the same number, e.g. 11, 22, etc., would also cause erroneous output. The latter solution I suggested above seems to be the correct course of action, it would however complicate things further :-). I may take a stab at it when I have more procrastination time

Ed Morton Over a year ago

You will be well and truly mentally exercised at the end of this endeavour :-). I'm looking forward to the OP telling us that her "patterns" aren't necessarily single characters and can actually be multi-character strings... that will make things a whole lot more interesting.

Ed Morton · Accepted Answer · 2018-09-20 00:54:30Z

I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:

$ cat tst.awk
{
    out = ""
    for (pos=1; pos<=length($0); pos+=reps) {
        char = substr($0,pos,1)
        for (reps=1; char == substr($0,pos+reps,1); reps++);
        out = out (reps > 1 ? reps "x" : "") char
    }
    print out
}

$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

The above was run against the sample input that @Thor kindly provided:

$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower() around each side of the comparison in the innermost for loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.

Collectives™ on Stack Overflow

How can I output the number of repeats of a pattern in regex?

4 Answers 4

1 Comment

3 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related