0

I would like to output the number of repeats of a pattern with regex. For example, convert "aaad" to "3xad", "bCCCCC" to "b5xC". I want to do this in sed or awk.

I know I can match it by (.)\1+ or even capture it by ((.)\1+). But how can I obtain the times of repeating and insert that value back to string in regex or sed or awk?

6
  • 2
    Counting in sed is extremely cumbersome, see for example gnu.org/software/sed/manual/sed.html#wc-_002dc Commented Sep 19, 2018 at 12:22
  • @BenjaminW. Thanks never know counting is that difficult in sed. Any other program can do this easily? Commented Sep 19, 2018 at 12:24
  • Perl comes to mind. Commented Sep 19, 2018 at 12:25
  • awk's fine too. Commented Sep 19, 2018 at 12:25
  • @revo can you provide an answer based on awk? Thanks! Commented Sep 19, 2018 at 12:28

4 Answers 4

4

Perl to the rescue!

perl -pe 's/((.)\2+)/length($1) . "x$2"/ge'
  • -p reads the input line by line and prints it after processing
  • s/// is the substitution similar to sed
  • /e makes the replacement evaluated as code

e.g.

aaadbCCCCCxx -> 3xadb5xC2xx
Sign up to request clarification or add additional context in comments.

1 Comment

sed on steroids!
2

In GNU awk:

$ echo aaadbCCCCCxx |  awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) {
        c=$i
        match(substr($0,i),c"+")
        b=b (RLENGTH>1?RLENGTH "x":"") c
    }
    print b
}'
3xadb5xC2xx

If the regex metachars want to be read as literal characters as noted in the comments one could try to detect and escape them (solution below is only directional):

$ echo \\\\\\..**aaadbCCCCC++xx |
awk -F '' '{
    for(i=1;i<=NF;i+=RLENGTH) { 
        c=$i                               
        # print i,c                        # for debugging
        if(c~/[*.\\]/)                     # if c is a regex metachar (not complete)
            c="\\"c                        # escape it
        match(substr($0,i),c"+")           # find all c:s
        b=b (RLENGTH>1?RLENGTH "x":"") $i  # buffer to b
    }
    print b
}'
3x\2x.2x*3xadb5xC2x+2xx

3 Comments

That would misbehave if $i was a regexp metachar such as . or an escape char `\`. It's unclear if the OP can have non-alphabetic chars in their input or not though so idk if it's a real issue or not.
... AND it supports regex... ;D
wrt the 2nd script - escaping characters turns some of them into the chars they represent when escaped rather than literal, e.g. t -> \t = <tab>. Try printf 'foo\tbar\n' | awk '{c="t"; c="\\"c; print match($0,c)}'. You need to put all chars except ^ inside square brackets instead and you need to escape only ^. See the answers at stackoverflow.com/q/29613304/1745001 which do this job for sed.
1

Just for fun.

With sed it is cumbersome but do-able. Note this example relies on GNU sed (:

parse.sed

/(.)\1+/ {
  : nextrepetition
  /((.)\2+)/ s//\n\1\n/             # delimit the repetition with new-lines
  h                                 # and store the delimited version
  s/^[^\n]*\n|\n[^\n]*$//g          # now remove prefix and suffix
  b charcount                       # count repetitions
  : aftercharcount                  # return here after counting
  G                                 # append the new-line delimited version

  # Reorganize pattern space to the desired format
  s/^([^\n]+)\n([^\n]*)\n(.)[^\n]+\n/\2\1x\3/

  # Run again if more repetitions exist
  /(.)\1+/b nextrepetition
}

b

# Adapted from the wc -c example in the sed manual
# Ref: https://www.gnu.org/software/sed/manual/sed.html#wc-_002dc
: charcount

s/./a/g

# Do the carry.  The t's and b's are not necessary,
# but they do speed up the thing
t a
: a;  s/aaaaaaaaaa/b/g; t b; b done
: b;  s/bbbbbbbbbb/c/g; t c; b done
: c;  s/cccccccccc/d/g; t d; b done
: d;  s/dddddddddd/e/g; t e; b done
: e;  s/eeeeeeeeee/f/g; t f; b done
: f;  s/ffffffffff/g/g; t g; b done
: g;  s/gggggggggg/h/g; t h; b done
: h;  s/hhhhhhhhhh//g

: done

# On the last line, convert back to decimal

: loop
/a/! s/[b-h]*/&0/
s/aaaaaaaaa/9/
s/aaaaaaaa/8/
s/aaaaaaa/7/
s/aaaaaa/6/
s/aaaaa/5/
s/aaaa/4/
s/aaa/3/
s/aa/2/
s/a/1/

y/bcdefgh/abcdefg/
/[a-h]/ b loop

b aftercharcount

Run it like this:

sed -Ef parse.sed infile

With an infile like this:

aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The output is:

3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

5 Comments

echo 'xx' | sed -Ef parse.sed seems to send that into an infinite loop.
@EdMorton: this stems from the choice of repetition indicator (x) and that my solution looks at the whole string after each replacement. Either choose a different indicator or modify the solution to only look at the rest of the string
That was complete and utter dumb luck, I had no idea there was anything special about an x! Are there any other characters or strings that aren't allowed to appear in the input? I couldn't modify that script if I wanted to - way too complicated for my sed abilities!
@EdMorton: No. Thinking about the repetition indicator issue, I realized that having plural lettered repetitions with the same number, e.g. 11, 22, etc., would also cause erroneous output. The latter solution I suggested above seems to be the correct course of action, it would however complicate things further :-). I may take a stab at it when I have more procrastination time
You will be well and truly mentally exercised at the end of this endeavour :-). I'm looking forward to the OP telling us that her "patterns" aren't necessarily single characters and can actually be multi-character strings... that will make things a whole lot more interesting.
1

I was hoping we'd have a MCVE by now but we don't so what the heck - here is my best guess at what you're trying to do:

$ cat tst.awk
{
    out = ""
    for (pos=1; pos<=length($0); pos+=reps) {
        char = substr($0,pos,1)
        for (reps=1; char == substr($0,pos+reps,1); reps++);
        out = out (reps > 1 ? reps "x" : "") char
    }
    print out
}

$ awk -f tst.awk file
3xad
d3xad3xa
fsdfjs
b5xC
3xad3xa

The above was run against the sample input that @Thor kindly provided:

$ cat file
aaad
daaadaaa
fsdfjs
bCCCCC
aaadaaa

The above will work for any input characters using any awk in any shell on any UNIX box. If you need to make it case-insensitive just throw a tolower() around each side of the comparison in the innermost for loop. If you need it to work on multi-character strings then you'll have to tell us how to identify where the substrings you're interested in start/end.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.