0

I have a text file with a line that reads:

<div id="page_footer"><div><? print('Any phrase's characters can go here!'); ?></div></div>

And I'm wanting to use sed or awk to extract the substring above between the single quotes so it just prints ...

Any phrase's characters can go here!

I want the phrase to be delimited as I have above, starting after the single quote and ending at the single-quote immediately followed by a parenthesis and then semicolon. The following sed command with a capture group doesn't seem to be working for me. Suggestions?

sed '/^<div id="page_footer"><div><? print(\'\(.\+\)\');/ s//\1/p' /home/foobar/testfile.txt
2
  • Unless you're using unicode or another character set such that the apostrophe is not exactly the same character as the single quote, or use some other form of context or anchors, this will be ambiguous. However, you could grab text between the (' and ') sequences instead. Quite possibly, your version of sed doesn't grok the same implementation of regular expression syntax you're trying to use there... Commented Oct 22, 2015 at 20:55
  • Yeah using (' and ') as anchors would be perfectly fine. Any suggestions for how to best implement this solution using sed or awk? Commented Oct 22, 2015 at 21:34

2 Answers 2

1

Incorrect would be using cut like

 grep "page_footer" /home/foobar/testfile.txt | cut -d "'" -f2

It will go wrong with single quotes inside the string. Counting the number of single quotes first will change this from a simple to an over-complicated solution.

A solution with sed is better: remove everything until the first single quote and everything after the last one. A single quote in the string becomes messy when you first close the sed parameter with a single quote, escape the single quote and open a sed string again:

grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*//' -e 's/[^'\'']*$//'

And this is not the full solution, you want to remove the first/last quotes as well:

grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*'\''//' -e 's/'\''[^'\'']*$//'

Writing the sed parameters in double-quoted strings and using the . wildcard for matching the single quote will make the line shorter:

grep page_footer /home/foobar/testfile.txt | sed -e "s/^[^\']*.//" -e "s/.[^\']*$//"
Sign up to request clarification or add additional context in comments.

Comments

1

Using advanced grep (such as in Linux), this might be what you are looking for

grep -Po "(?<=').*?(?='\);)"

2 Comments

I'm not very familiar with Perl regular expressions. Could you please explain your answer somewhat? Seems like it's just using (' and ') as anchors and extracting the substring. Could this be expanded to also incorporate all the text to the left of first anchor as well (<div id="page_footer"><div><? print). Thank you.
There are two reasons I used Perl-like expressions: non-greedy capturing .*? (so that you could grab several print statements in a same line) and lookaheads/lookbehinds (regular-expressions.info/lookaround.html). Lookaheads/lookbehinds are different from normal capturing groups in that they do not capture (include in output) matching parts, they just see that those matching parts exist.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.