0

I'm working with a wordpress xml dump, and for whatever reason, wordpress has exported every user in our database as an "author" of each post. In order to make the xml file easier to work with, I would like to remove all of the author nodes except for one.

Here's an example of what I have:

    <rss version="2.0" xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/">
<wp:author>
    <wp:author_id>35622</wp:author_id>
    <wp:author_login>some_username_1</wp:author_login>
    <wp:author_email>[email protected]</wp:author_email>
    <wp:author_display_name><![CDATA[some_username_1]]></wp:author_display_name>
    <wp:author_first_name><![CDATA[]]></wp:author_first_name>
    <wp:author_last_name><![CDATA[]]></wp:author_last_name>
</wp:author>
<wp:author>
    <wp:author_id>35290</wp:author_id>
    <wp:author_login>my_unique_username</wp:author_login>
    <wp:author_email>[email protected]</wp:author_email>
    <wp:author_display_name><![CDATA[my_unique_username]]></wp:author_display_name>
    <wp:author_first_name><![CDATA[]]></wp:author_first_name>
    <wp:author_last_name><![CDATA[]]></wp:author_last_name>
</wp:author>
<wp:author>
    <wp:author_id>35289</wp:author_id>
    <wp:author_login>some_username_2</wp:author_login>
    <wp:author_email>[email protected]</wp:author_email>
    <wp:author_display_name><![CDATA[some_username_2]]></wp:author_display_name>
    <wp:author_first_name><![CDATA[]]></wp:author_first_name>
    <wp:author_last_name><![CDATA[]]></wp:author_last_name>
</wp:author>
<wp:author>
    <wp:author_id>33404</wp:author_id>
    <wp:author_login>some_username_3</wp:author_login>
    <wp:author_email>[email protected]</wp:author_email>
    <wp:author_display_name><![CDATA[some_username_3]]></wp:author_display_name>
    <wp:author_first_name><![CDATA[]]></wp:author_first_name>
    <wp:author_last_name><![CDATA[]]></wp:author_last_name>
</wp:author>

Times a few thousand more entries

I would like to remove all of the nodes except for this one:

<wp:author>
    <wp:author_id>35290</wp:author_id>
    <wp:author_login>my_unique_username</wp:author_login>
    <wp:author_email>[email protected]</wp:author_email>
    <wp:author_display_name><![CDATA[my_unique_username]]></wp:author_display_name>
    <wp:author_first_name><![CDATA[]]></wp:author_first_name>
    <wp:author_last_name><![CDATA[]]></wp:author_last_name>
</wp:author>

Attempting to do this in a shell script but I'm not really sure where to start as I've never used xmlstarlet before so would appreciate any help.

Updated to reflect data root and solution that I found:

xmlstarlet ed -d "//wp:author[wp:author_id != '35290']" file.xml > out.xml
0

2 Answers 2

1

The solution I found is as follows:

xmlstarlet ed -d "//wp:author[wp:author_id != '35290']" file.xml > out.xml
Sign up to request clarification or add additional context in comments.

Comments

0

Taking just a snippet out of an XML file doesn't really give us enough to provide a complete answer. I wrapped this sample data in a root tag:

<root xmlns:wp="some.url">
...
</root

Then you can provide an XPath expression to find the node you're looking for: all "wp:author" nodes that contain a "wp:author_id" child with the specific value.

$ xmlstarlet sel -t -c '//wp:author[wp:author_id = "35289"]' file.xml
<wp:author xmlns:wp="some.url">
    <wp:author_id>35289</wp:author_id>
    <wp:author_login>some_username_2</wp:author_login>
    <wp:author_email>[email protected]</wp:author_email>
    <wp:author_display_name>some_username_2</wp:author_display_name>
    <wp:author_first_name></wp:author_first_name>
    <wp:author_last_name></wp:author_last_name>
</wp:author>

I've found this page of XPath examples helpful

3 Comments

My apologies, I'm only adept enough at xml to work with it in php. The correct data root is: <rss version="2.0" xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/">
I did eventually find what I was looking for, which is very similar to the example you gave: xmlstarlet ed -d "/rss/channel/wp:author[wp:author_id != '35289']" file.xml > file2.xml
Good one: that will keep all the enclosing XML tags which my answer does not. I recommend you provide an answer to your own question and then accept it. That will guide future readers to the correct solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.