1

I want to get the string between <sometag param=' and '>

I tried to use the method from Get any string between 2 string and assign a variable in bash to get the "x":

 echo "<sometag param='x'><irrelevant stuff='nonsense'>" | tr "'" _ | sed -n 's/.*<sometag param=_\(.*\)_>.*/\1/p'

The problem (apart from low efficiency because I just cannot manage to escape the apostrophe correctly for sed) is that sed matches the maximum, i.e. the output is:

 x_><irrelevant stuff=_nonsense

but the correct output would be the minimum-match, in this example just "x"

Thanks for your help

1
  • For structured data, use a tool which understands the structure. man xsltproc Commented Dec 19, 2012 at 5:41

2 Answers 2

3

You are probably looking for something like this:

sed -n "s/.*<sometag param='\([^']*\)'>.*/\1/p"

Test:

echo "<sometag param='x'><irrelevant stuff='nonsense'>" | sed -n "s/.*<sometag param='\([^']*\)'>.*/\1/p"

Results:

x

Explanation:

  • Instead of a greedy capture, use a non-greedy capture like: [^']* which means match anything except ' any number of times. To make the pattern stick, this is followed by: '>.
  • You can also use double quotes so that you don't need to escape the single quotes. If you wanted to escape the single quotes, you'd do this:

-

... | sed -n 's/.*<sometag param='\''\([^'\'']*\)'\''>.*/\1/p'

Notice how that the single quotes aren't really escaped. The sed expression is stopped, an escaped single quote is inserted and the sed expression is re-opened. Think of it like a four character escape sequence.


Personally, I'd use GNU grep. It would make for a slightly shorter solution. Run like:

... | grep -oP "(?<=<sometag param=').*?(?='>)"

Test:

echo "<sometag param='x'><irrelevant stuff='nonsense'>" | grep -oP "(?<=<sometag param=').*?(?='>)"

Results:

x
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, the grep-based solution is what I was looking for.
FYI: The last grep test expression doesn't execute with the grep implementation on OS X 10.11. It may not work on BSDs in general. It DOES work on Ubuntu. :)
0

You don't have to assemble regexes in those cases, you can just use ' as the field separator

in="<sometag param='x'><irrelevant stuff='nonsense'>"

IFS="'" read x whatiwant y <<< "$in"            # bash
echo "$whatiwant"

awk -F\' '{print $2}' <<< "$in"                 # awk

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.