Shell script to extract recursive xml tags

Question

I have an XML file of form:

...
<element1>
<element2>
<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>
<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>
<element2>
...

I used

sed -n '/\<group1\>/,\<\/group1>/p' filename

to extract all content of group1 tags, including all childs. This is exactly what I want.

<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>
<group1>
<tag1>value</tag1>
<tag2>value</tag2>
</group1>

However, if the input XML is of form

...
<element1>
<element2>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
<element3>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
...

And I tried to extract following content

<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>

The sed command above just returns:

<group2>
     <group2>value</group2>

It understands the stop pattern </group2> and do no more extraction. I'm quite confused here. Why doesn't it continue extracting the next <group2>, as in <group1> case. Is there any way to make it work with sed? and any other alternatives?

Regular expressions do not deal well with recursive structures. I'd suggest choosing a language with a proper XML parser available. — chepner
– chepner, Commented Oct 9, 2013 at 17:32

BeniBela · Accepted Answer · 2013-10-09 21:18:36Z

1

Far better to use XPath with a command line xpath interpreter, like xpath, xmlstarlet, my xidel or xmllint.

All group elements on the 3rd level:

/elememt1/*/group1

All group elements that do not contain a group2:

//group2[not(group2)]

answered Oct 9, 2013 at 21:18

BeniBela

17.1k4 gold badges48 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

TPT Gin Over a year ago

I opted for xmlstarlet, this provides some other great stuff. Thanks

BeniBela Over a year ago

Well, my Xidel has the advantage that it supports XPath 2 (and XQuery). Xmlstarlet has only XPath 1 ...

CS Pei · Accepted Answer · 2013-10-10 12:57:41Z

1

You can change your sed like this

sed -n '/\<group1\>/,/^<\/group1>/p' filename  | grep -v 'element3'

edited Oct 10, 2013 at 12:57

answered Oct 9, 2013 at 17:34

CS Pei

11.1k1 gold badge29 silver badges46 bronze badges

1 Comment

TPT Gin Over a year ago

This does not work. It prints out even the <element3> in the middle of 2 <group2>

Jotne · Accepted Answer · 2013-10-09 17:35:41Z

0

Some like this?

awk '/^<group2>/,/^<\/group2>/' file
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>
<group2>
     <group2>value</group2>
     <otherTag>value</otherTag>
</group2>

This works if there are different spacing on the tag, if all is adjusted to the left, it will not work

answered Oct 9, 2013 at 17:35

Jotne

41.7k13 gold badges54 silver badges58 bronze badges

Collectives™ on Stack Overflow

Shell script to extract recursive xml tags

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related