2

I have the following xml file. I want to edit it by removing the url and title attributes from every element <doc></doc>. I am looking for a unix command that can help instead of writing a whole code.


<documents>
<doc id="852" url="http://en.wikipedia.org/wiki?curid=852" title="...">
<text>
 Some text...
</text>
</doc>

<doc id="853" url="http://en.wikipedia.org/wiki?curid=853" title="...">
<text>
 Some text...
</text>
</doc>

<doc id="854" url="http://en.wikipedia.org/wiki?curid=854" title="...">
<text>
 some text...
</text>
</doc>

</documents>
1
  • 1
    I'm thinking sed could do this. Commented Jul 29, 2015 at 14:22

1 Answer 1

3

If the xML is as consistent as this, a simple example that could work is:

sed -r 's/^(<doc .* )url=".*/\1>/' myfile.xml

That says to identify lines that start with a <doc tag, save the contents up to url, discarding the rest of the line, and re-closing with a new >.

You could get more careful with the regex, but sed is a good tool for this, IF the XML is totally predictable.

If you want to change the file in-place, add a -i to the sed invocation.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.