2

I've seen the following stackoverflow How to use regex for multiple line pattern in shell script but it doesn't do exactly what I want. I'm looking for a terminal based way of doing an in-place sed (or perl) regex that will auto change some files for me. (I can probably do it with xml libraries/etc., but I would prefer to use the terminal).

The file I have

Some text
<div class="firstClass secondClass" something="else">
    Some random stuff
</div>
Random Text
<div class="thirdClass fifthClass" something="else">
    Some random stuff
    < is something
    < but not /> This
</div>
<div class="fourthClass">
    Some random stuff
</div>
Final Text

I tried to do an arbitrary enough example to show a few different use-cases. I'm trying to convert it into something like the following:

Some text
<!-- firstClass start -->
    Some random stuff
<!-- firstClass end -->
Random Text
<!-- thirdClass start -->
    Some random stuff
    < is something
    < but not /> This
<!-- thirdClass end -->
<!-- fourthClass start -->
    Some random stuff
<!-- fourthClass end -->
Final Text

I am trying the following code:

sed -n '/<div class="\([^ "]*\)[^>]*>/,/<\/div>/{s/<div class="\([^ "]*\)[^>]*>/<!-- \1 start -->/;/<\/div>/d;p}' file

but since in the previous stackoverflow question the person didn't want the final line, the answers deleted it, which is not what I want. As can be seen, I want that first text repeated before and after the inside contents.

The regex above properly fixes the first line (changes the div to a comment), but I can't seem to replicate that below the text. I tried to mess around with the regex expression, but I can't seem to get it to work. It's additionally cutting out the very first line and the last lines although I'd like to keep them. Any ideas how to do something like this?

(PS, yes, I know we need sed -i for an in-place command, but I want to test it out before I actually run through with it for obvious reasons)

Edit: A little addendum as to the idea of what I'm trying to do. Although the above is HTML, this code is not necessarily exclusively for HTML (hence why I don't want HTML/XML processing). The idea is:

Some random text before my pattern
PATTERN "info ...
  random stuffs
END PATTERN
Some random stuff after pattern

I'd like this to be converted to

Some random text before my pattern
NEW PATTERN - info 
  random stuffs
END NEW PATTERN - info
Some random stuff after pattern

So no html necessarily. Just something that takes a pattern above some text, replicates it below. The only condition is that random stuffs will not have the text END PATTERN and so that's what I want to base it off of. random stuffs will 100% never ever have the END PATTERN text. There's no nesting involved nor any edge cases. It's always the same pattern as shown above. The only "edge" case is that the first line PATTERN "info ... might have some extra text up until a line break which I don't care about. That can always be deleted. I only care about the word info (aka up until the first space character or first " character.)

16
  • 1
    That's deep in HTML processing and, of course, you want a library. For example, Mojo::DOM is great while I've used HTML::TreeBuilder nicely as well. Then, you'll want to articulate your requirements as precisely as possible. (Always/only div elements? Any nesting? ...) Commented Oct 24, 2023 at 7:33
  • No nesting, always div. The problem is that it's only "html" processing because the example I gave uses <div> instead of ##some or some other string. Which is why I want to stay away from HTML processing. I want to use this with things other than html as well. I'll try and add an addendum with a little more about the idea. Commented Oct 24, 2023 at 7:42
  • "...it's only "html" processing because the example I gave uses <div> instead of ##some or some other string." -- huh? But you want to capture between <div> and </div> tags? Call it whatever but that's structured text. How does it go with "##some" ? Whats' the closing element? It'd better be some known format where you can use libraries or you'll have to write a parser and it won't be a one-liner I'm afraid. (Unless you actually have a trivial case of start--text--stop) Commented Oct 24, 2023 at 7:55
  • "add an addendum with a little more about the idea" -- by all means, since what you posted now is crystal clear: use HTML parser (and specify the problem better if you want answers). But, again, I suggest to try to be specific. What you mentioned in a comment is very open-ended Commented Oct 24, 2023 at 7:59
  • If you are certain that it's dirt simple (no nesting, known start-stop tags, no edge cases, etc) then state that clearly. In that case, yeah you can have a simple regex, if that's really all there is to it. Commented Oct 24, 2023 at 8:01

4 Answers 4

2

For starters, here is a simple take that works in my tests on the particular posted text

s{<div\s+ class="(\S+) (.*?) </div>}{<!-- $1 --> $2 <!-- $1 end -->}sxg;

The modifiers are: s so that . matches a linefeed as well (normally it doesn't), x so that literal spaces are ignored, what helps readability, and g so that this keeps going through the string, matching-and-substituting.

I'd recommend a program in a file for this, not a command-line one ("one-liner"), but since that was specifically asked for in the question here

perl -0777 -wpe'
    s{<div\s+ class="(\S+) (.*?) </div>}{<!-- $1 --> $2 <!-- $1 end -->}sxg'

The -0777 switch makes it read the entire file into the $_ variable, which is default for many things in Perl -- regex's s{}{} operator in this case. See switches in perlrun.


In a larger and more structured program you could perhaps have beginning and end patterns in variables, for

s{$pbeg (.*?) $pend}{...}sxg

where for this case it would be

my $pbeg = qr{<div\s+ class="(\S+)};
my $pend = qr{</div>}

However, this could turn unwieldy if those patterns get complex/

Sign up to request clarification or add additional context in comments.

Comments

1

This might work for you (GNU sed):

sed -E '/^<div class="([^ "]*).*/{
          s//<!-- \1 start -->/;h;:a;n;/^<\/div>$/!ba;g;s/\bstart/end/}' file

Match a start div.

Manipulate that line into the desired format and make a copy.

Print/fetch the next line until the ending div.

Replace that line with the copy and replace start with end and print the result.

Repeat.

Comments

1

Using GNU awk for the 3rd arg to match() and strongly typed regexp constants on your first example:

$ cat defs1.awk
BEGIN {
    begReg = @/<div\s+class="([^" ]+)/
    endReg = @/<\/div>/
    begFmt = "<!-- %s start -->"
    endFmt = "<!-- %s end -->"
}

$ cat common.awk
match($0,begReg,a) {
    key = a[1]
    $0 = sprintf(begFmt,key)
}
match($0,endReg,a) {
    $0 = sprintf(endFmt,key)
}
{ print }

$ awk -f defs1.awk -f common.awk file1
Some text
<!-- firstClass start -->
    Some random stuff
<!-- firstClass end -->
Random Text
<!-- thirdClass start -->
    Some random stuff
    < is something
    < but not /> This
<!-- thirdClass end -->
<!-- fourthClass start -->
    Some random stuff
<!-- fourthClass end -->
Final Text

and for your second example we just need a new definitions file but can reuse common.awk from above:

$ cat defs2.awk
BEGIN {
    begReg = @/PATTERN "([^" ]+)/
    endReg = @/END PATTERN/
    begFmt = "NEW PATTERN - %s"
    endFmt = "END NEW PATTERN - %s"
}

$ awk -f defs2.awk -f common.awk file2
Some random text before my pattern
NEW PATTERN - info
  random stuffs
END NEW PATTERN - info
Some random stuff after pattern

Note we just define in the BEGIN sections in the 2 defs*.awk files the desired input regexps and output format, we don't change the rest of the code in common.awk. All that depends on is that you can define the first capture group in the regexp that matches your beginning delimiter to contain the key info you want retained/printed in the begin and end lines.

You don't strictly need match() for the endReg match, but I used it in case you need to tweak it for other ending delimiter formats in future.

Just change awk to awk -i inplace to do the same pseduo-inplace editing that all the other tools do.

Comments

0

Here's a simple Awk script which extracts the first token after class=" and uses that in the replacement text.

awk '/<div class="/ { sub(/.*<div class="/, ""); sub(/[" ].*/, "");
    class=$0; print "<--", class, "start -->"; next }
  /<\/div>/ { print "<--", class", "end -->"; class=""; next }
  1' file >new

There is nothing "multi-line" here in terms of regex matching, just a simple facility for remembering some state between lines. Awk is still examining one line at a time (though it's not hard to change that either if you need to; see RS).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.