1

I have an XML file of form:

...
<element1>
<element2>
<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>
<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>
<element2>
...

I used

sed -n '/\<group1\>/,\<\/group1>/p' filename

to extract all content of group1 tags, including all childs. This is exactly what I want.

<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>
<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>

However, if the input XML is of form

...
<element1>
<element2>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
<element3>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
...

And I tried to extract following content

<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>

The sed command above just returns:

<group2>
     <group2>value</group2>

It understands the stop pattern </group2> and do no more extraction. I'm quite confused here. Why doesn't it continue extracting the next <group2>, as in <group1> case. Is there any way to make it work with sed? and any other alternatives?

2
  • 1
    Regular expressions do not deal well with recursive structures. I'd suggest choosing a language with a proper XML parser available. Commented Oct 9, 2013 at 17:32
  • Obligatory link to stackoverflow.com/a/1732454/78845 Commented Oct 9, 2013 at 17:40

3 Answers 3

1

Far better to use XPath with a command line xpath interpreter, like xpath, xmlstarlet, my xidel or xmllint.

All group elements on the 3rd level:

/elememt1/*/group1

All group elements that do not contain a group2:

//group2[not(group2)]
Sign up to request clarification or add additional context in comments.

2 Comments

I opted for xmlstarlet, this provides some other great stuff. Thanks
Well, my Xidel has the advantage that it supports XPath 2 (and XQuery). Xmlstarlet has only XPath 1 ...
1

You can change your sed like this

sed -n '/\<group1\>/,/^<\/group1>/p' filename  | grep -v 'element3'

1 Comment

This does not work. It prints out even the <element3> in the middle of 2 <group2>
0

Some like this?

awk '/^<group2>/,/^<\/group2>/' file
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>

This works if there are different spacing on the tag, if all is adjusted to the left, it will not work

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.