Extracting Substring from String with Multiple Special Characters Using Sed

Question

I have a text file with a line that reads:

<div id="page_footer"><div><? print('Any phrase's characters can go here!'); ?></div></div>

And I'm wanting to use sed or awk to extract the substring above between the single quotes so it just prints ...

Any phrase's characters can go here!

I want the phrase to be delimited as I have above, starting after the single quote and ending at the single-quote immediately followed by a parenthesis and then semicolon. The following sed command with a capture group doesn't seem to be working for me. Suggestions?

sed '/^<div id="page_footer"><div><? print(\'\(.\+\)\');/ s//\1/p' /home/foobar/testfile.txt

Unless you're using unicode or another character set such that the apostrophe is not exactly the same character as the single quote, or use some other form of context or anchors, this will be ambiguous. However, you could grab text between the (' and ') sequences instead. Quite possibly, your version of sed doesn't grok the same implementation of regular expression syntax you're trying to use there... — twalberg
– twalberg, Commented Oct 22, 2015 at 20:55
Yeah using (' and ') as anchors would be perfectly fine. Any suggestions for how to best implement this solution using sed or awk? — user2150250
– user2150250, Commented Oct 22, 2015 at 21:34

Walter A · Accepted Answer · 2015-10-23 19:28:59Z

Incorrect would be using cut like

 grep "page_footer" /home/foobar/testfile.txt | cut -d "'" -f2

It will go wrong with single quotes inside the string. Counting the number of single quotes first will change this from a simple to an over-complicated solution.

A solution with sed is better: remove everything until the first single quote and everything after the last one. A single quote in the string becomes messy when you first close the sed parameter with a single quote, escape the single quote and open a sed string again:

grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*//' -e 's/[^'\'']*$//'

And this is not the full solution, you want to remove the first/last quotes as well:

grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*'\''//' -e 's/'\''[^'\'']*$//'

Writing the sed parameters in double-quoted strings and using the . wildcard for matching the single quote will make the line shorter:

grep page_footer /home/foobar/testfile.txt | sed -e "s/^[^\']*.//" -e "s/.[^\']*$//"

Vytenis Bivainis · Accepted Answer · 2015-10-22 22:06:07Z

1

Using advanced grep (such as in Linux), this might be what you are looking for

grep -Po "(?<=').*?(?='\);)"

answered Oct 22, 2015 at 22:06

Vytenis Bivainis

2,39622 silver badges29 bronze badges

2 Comments

user2150250 Over a year ago

I'm not very familiar with Perl regular expressions. Could you please explain your answer somewhat? Seems like it's just using (' and ') as anchors and extracting the substring. Could this be expanded to also incorporate all the text to the left of first anchor as well (<div id="page_footer"><div><? print). Thank you.

Vytenis Bivainis Over a year ago

There are two reasons I used Perl-like expressions: non-greedy capturing .*? (so that you could grab several print statements in a same line) and lookaheads/lookbehinds (regular-expressions.info/lookaround.html). Lookaheads/lookbehinds are different from normal capturing groups in that they do not capture (include in output) matching parts, they just see that those matching parts exist.

Collectives™ on Stack Overflow

Extracting Substring from String with Multiple Special Characters Using Sed

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related