Delete tag from xml using bash tools

Question

Question

I have an application that creates logs in the format

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>111</e></d><d><e>123</e></d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4

Using the bash tools I would like to remove all tags that do not have descendant <e>123</e>

to such form

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>123</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4

I tried to do this using awk and sed, but I failed. Please help in writing a script or an indication of other tools that can do this.

Info (moved from comment)

At the moment I have such a (best I've found) solution."

echo '2014-09-01 12: 01: 01.899;some app logs 2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>111</e></d><d><e>123</e></d><d><e>222</e></d><d><e>333</e><‌/d></c></b></a>;some app logs3 2014-09-01 12: 01: 03,625;some app logs4' | awk '{print "<d" $0}' RS="<d" | sed -n '1 s/^<d// ; /^<d/ ! p; /^<d.*>123</ p'

Regards

Krzysiek

What have you tried? Where are you stuck at? Please share your code. Also, wouldn't it be better to change the way your application provides the logs? — fedorqui
– fedorqui, Commented Sep 4, 2014 at 9:17
I tried to use sed multiline pattern to remove unnecessary portions of XML: cat test.log | awk '{print "<" $0 }' RS="<" | awk '{print $0 ">"}' RS=">" | sed '/^\s*$/d' | sed '/<[^\/][^>]*>/ {x; s/.*//; x}; {H; g;}; /<[^>/]*>123<\/[^>]*>/ ! d; /<\/[^>]*$/ {p; x; s/.*//; x;}' — kawu
– kawu, Commented Sep 4, 2014 at 11:34
Edit your question with this information. It is not practical to write code in comments. — fedorqui
– fedorqui, Commented Sep 5, 2014 at 9:08
Any time you find yourself using more than s, g, and p (with -n) in sed you have the wrong approach. All of the sed commands to operate on multi-line input became obsolete in the mid-1970s when awk was invented. — Ed Morton
– Ed Morton, Commented Sep 19, 2014 at 13:47
Define bash tools: Perl? Python? xpath? sed and awk only? gawk? — dawg
– dawg, Commented Sep 19, 2014 at 17:14

Ed Morton · Accepted Answer · 2014-09-19 13:51:40Z

3

+100

Try this:

$ awk -v t="<d><e>123</e></d>" '{gsub(t,RS); gsub("<d><e>[^<]+</e></d>",""); gsub(RS,t)}1' file
2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>123</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4

The above simply takes each line of your input file and replaces all occurrences of your target string <d><e>123</e></d> with a newline (which obviously cannot be present within the original line), then removes every other string that matches <d><e>[^<]+</e></d>, then replaces all newlines with the target string (i.e. restores the newlines we added earlier to their original values).

If that's not what you want, edit your question to clarify your requirements and provide a more representative example.

edited Sep 19, 2014 at 13:51

answered Sep 19, 2014 at 13:08

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Makyen Over a year ago

This solution does not remove empty <a><b><c></c></b></a> tags. Input:

2014-09-01 12: 01: 04,045;some app logs5;<a><b><c><d><e>111</e></d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs6

output:2014-09-01 12: 01: 04,045;some app logs5;<a><b><c></c></b></a>;some app logs6 I'm not sure that is a requirement, but the question did say "remove all tags that do not have descendant <e>123</e>"

Ed Morton Over a year ago

You can consider all sorts of potential input, e.g. <a></a> but all we can do is hope the OP's posted sample input truly represents their use cases and rely on the OP telling us if they find they forgot to show us something.

Makyen Over a year ago

Hmmm... Good point. Although in this case, the question, but not the test input, did cover the case of "all tags that do not have descendant <e>123</e>" and the the input implied that not having a <e>123</e> within the tags, at all, was a reasonable possibility. Might I suggest updating your answer to cover cases of all tags that don't contain <e>123</e>?

Ed Morton Over a year ago

I expect just changing the + to a * is all that's required, if the OP posts that they need it then I'll code and test it.

Avinash Raj · Accepted Answer · 2014-09-19 16:29:40Z

You could simply do this through perl,

$ perl -pe 's/<e>(?:(?!\b123\b).)*?<\/e>//g; s/<([^><]*)><\/\1>//g' file
2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>123</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4

Explanation:

<e>(?:(?!123).)*?<\/e> Matches all the <e> tags other than <e>123</e>. In the first part all the matched <e> tags are removed.
The second part <([^><]*)><\/\1> would remove all the tags which has immediate endings.(ie, an opening tag immediately followed by a closing tag)

OR

This would remove all the <e> tags which don't contain the exact string 123.

perl -pe 's/(?:(?!<e>123<\/e>)<e>.*?<\/e>)//g; s/<([^><]*)><\/\1>//g' file

Example:

$ cat file
2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>111</e></d><d><e>123</e><e>1234</e><e>123:4</e></d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4

$ perl -pe 's/(?:(?!<e>123<\/e>)<e>.*?<\/e>)//g; s/<([^><]*)><\/\1>//g' file
2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>123</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4

Makyen · Accepted Answer · 2014-09-19 23:32:10Z

1

The following assumes the input text is in a file called 'test.log' and that you wanted a solution in the form of something you are piping the input into and out (i.e. cat 'test.log' is used instead of specifying it as the input).

Using a placeholder value:

With a problem where you are attempting to use regular expressions to act on everything very similar to a pattern which you want to keep it is often easier to first change the text you desire to not act on to a placeholder value that is easily distinguished from the patterns you do desire to act upon:

cat test.log | sed -e "s/Q/Qz/g" -e "s/<e>123<\/e>/Qa/g" -e "s/<e>[^<]*<\/e>//g" -e "s/Qa/<e>123<\/e>/g" -e "s/Qz/Q/g" -e "s/<[^e]>[^<]*<\/[^e]>;\?//g" -e "s///g" -e "s///g" -e "s///g" -e "s///g" -e "s///g"

The trick is realizing that the data you are manipulating does not have to keep the form it was in throughout the intermediate forms you are manipulating. It is only the output that matters. Thus, the transformations of the data are:

Input (Added a line where there is no <e>123</e> at all in the tags. It is a case that we probably need to handle):

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>111</e></d><d><e>123</e></d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4
2014-09-01 12: 01: 04,045;some app logs5;<a><b><c><d><e>111</e></d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs6

Intermediary form 1 (just exists line by line within sed): Same as input because no "Q" existed in test data.

Intermediary form 2 (within sed): change text we want to keep to placeholder:

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>111</e></d><d>Qa</d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4
2014-09-01 12: 01: 04,045;some app logs5;<a><b><c><d><e>111</e></d><d><e>222</e></d><d><e>333</e></d></c></b></a>;some app logs6

Intermediary form 3 (remove <e></e> tags which don't contain 123):

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d></d><d>Qa</d><d></d><d></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4
2014-09-01 12: 01: 04,045;some app logs5;<a><b><c><d></d><d></d><d></d></c></b></a>;some app logs6

Intermediary form 4 (substitute <e>123</e> back from placeholder):

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d></d><d><e>123</e></d><d></d><d></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4
2014-09-01 12: 01: 04,045;some app logs5;<a><b><c><d></d><d></d><d></d></c></b></a>;some app logs6

Intermediary form 5 (unclear the placeholder): (same as form 4, as there is no "Q").

output (after substitutions to remove empty tags):

2014-09-01 12: 01: 01.899;some app logs
2014-09-01 12: 01: 02,045;some app logs2;<a><b><c><d><e>123</e></d></c></b></a>;some app logs3
2014-09-01 12: 01: 03,625;some app logs4
2014-09-01 12: 01: 04,045;some app logs5;some app logs6

It was assumed that we should not leave a "some app logs5;;some app logs6" but "some app logs5;some app logs6". If that is not the case, the regular expression can be adjusted.

Issues when using a placeholder

If your placeholder is not unique, then when changing back from the placeholder you corrupt the data. To have a unique placeholder in unknown input data you have to expend a substitution to clear out any current uses of the placeholder and a substitution to revert your clearing of it. To do this you can use a substitution such as: sed -e "s/Q/Qz/g" This results in no possibility that there is any two letter combination starting with Q in the text other than "Qz". You then have a large number of potential unique two-letter placeholders (e.g. "Qa", "Qb", "Qc", "QA", etc.). After you are done using them, you can change back to your text by reversing the substitution: sed -e "s/Qz/Q/g" With multiple unique placeholders available it is possible to use them to represent multiple other strings. With this method you must keep in mind in all operations which you are matching text while using the placeholders that the initial clearance had been performed.

In some instances, if you know the characteristics of your input data you can choose a placeholder which will never occur in that data. This can save you the CPU cost of the two substitution operations and potential additional memory which which clearing out the two character placeholder can cost. However, with log files one of the things that you are looking for is corruption, so using a short placeholder that you are only assuming is not in the data is a bad idea.

If you do not know your exact input by included characters, but you do know some characteristics of the input then you can choose to save those two substitutions by using a placeholder which is only very, very unlikely to exist in your input, but is not guaranteed to be unique. This does introduce some risk. In such case, the more complex the string you use for your placeholder, and the less it resembles something that is a possible input, the lower your risk is that you might select a placeholder which exists in your input.

For this example, the text "lOnG3Rep5LacEN2eV7E9rE4xIST" is very unlikely to exist in the input log file even if it was corrupted.

The following assumes the input text is in a file called 'test.log' for convenience. Also, it assumes that "lOnG3Rep5LacEN2eV7E9rE4xIST" will not exist in the input. What is actually used for the intermediary string can, of course, be anything you want which will be unique:

cat test.log | sed -e "s/<e>123<\/e>/lOnG3Rep5LacEN2eV7E9rE4xIST/g" -e "s/<e>[^<]*<\/e>//g" -e "s/lOnG3Rep5LacEN2eV7E9rE4xIST/<e>123<\/e>/g" -e "s/<[^e]>[^<]*<\/[^e]>;\?//g" -e "s///g" -e "s///g" -e "s///g" -e "s///g" -e "s///g"

Choosing to use a placeholder that you have not guaranteed does not exist in the input data is a risk. You should not do so unless you understand the risk and have chosen to accept it. It is much more reasonable to accept such risk when the output is going to be immediately reviewed by a human who would catch any such problems.

Thanks go to Ed Morton who reminded me that I had gotten into the habit of accepting that risk without enough consideration.

Using a regular expression to define something is not:

Character by character:

Because the pattern "123" is quite simple and exact, it is relatively easy to define a regular expression that matches everything except that string. Note that this becomes much more complex with a more complex pattern that you are attempting to exclude from matching:

cat test.log | sed -e "s/<e>\(\|[^1<][^<]*\|1[^2<][^<]*\|12[^3<][^<]*\)<\/e>//g" -e "s/<[^e]>[^<]*<\/[^e]>;\?//g" -e "s///g" -e "s///g" -e "s///g" -e "s///g" -e "s///g"

This builds up a regular expression with sub-patterns that progressively match everything longer by one character which is not the pattern you desire not to match.

Negative look ahead/look behind:

Many implementations of regular expression syntax provide a negative look-ahead or look-behind operator. These can be used to generate more complex matches of "not this string".

edited Sep 19, 2014 at 23:32

answered Sep 19, 2014 at 18:19

Makyen♦

33.6k12 gold badges94 silver badges128 bronze badges

8 Comments

Ed Morton Over a year ago

The way to temporarily replace foo with a string that can't exist in the input using sed is: sed -e 's/a/aA/g' -e 's/foo/aB/g' file | do stuff | sed -e 's/aB/foo/g' -e 's/aA/a/g'. Just replace a with a character that doesn't exist in string foo if necessary.

Makyen Over a year ago

The complexity of the placeholder string depends on what you know about the possible contents of the input text. I was intending to illustrate a concept, not provide a minimum solution. Your suggestion assumes a known single character that can not exist in the input, or that the short string "aA" does not exist. I was illustrating using a longer string, which is much less likely to exist in the input. The more you know about what can not exist in the input, the simpler your intermediary string can be. If a single character is known not to exist then your placeholder can be that character.

Ed Morton Over a year ago

No, none of that is true. aA can exist just fine in the input, the character a can exist in the input, etc. - none of that matters, the solution will work as-is no matter what the input file contains. Try it and think about it a bit.

Makyen Over a year ago

Yes, you are correct. I responded too fast. aA becomes aAA which is then changed back to aA.

Ed Morton Over a year ago

Right. The first substitution just ensures that afterwards there CANNOT BE an occurrence of aB (or aC or a% or anything else) in the input file afterwards because every aX (where X is any or no character) becomes aAX and it likewise converts aA to aAA. So, after that you can replace any string(s) you want to with aB, a:, or any other 2-char string starting with a except aA since you just guaranteed they are not otherwise present in the input. Then after doing whatever you want with the result, you just unwind the original transformations by doing them in reverse order.

|

Kokkie · Accepted Answer · 2014-09-22 07:03:40Z

1

Maybe this awk example can lead you in the right direction:

$ awk -F';' '{gsub("<d><e>[^0-9]*</e></d>", "", $3)} {print}' some.log
2014-09-01 12: 01: 01.899; And, some app logs
2014-09-01 12: 01: 02,045  And, some app logs2 <a><b><c><d><e>123456789</e></d></c></b></a> some app logs3
2014-09-01 12: 01: 03,625; And, some app logs4

Explanation
-F';' field separator is a semicolon
gsub("<d><e>[^0-9]*</e></d>", "", $3) do a global substitution if the data in column 3 between the tag <e> is not a number

edited Sep 22, 2014 at 7:03

answered Sep 4, 2014 at 10:40

Kokkie

5566 silver badges16 bronze badges

2 Comments

kawu Over a year ago

Thanks to your response, I see that I need to improve the input data to better present my problem. All <e> tag have numeric data and I'm only interested in one particular <e>123</e>.

Ed Morton Over a year ago

The name for the symbol ; is semi-colon, not dot-comma.

Collectives™ on Stack Overflow

Delete tag from xml using bash tools

Question

Info (moved from comment)

4 Answers 4

4 Comments

Comments

Using a placeholder value:

Using a regular expression to define something is not:

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Question

Info (moved from comment)

4 Answers 4

4 Comments

Comments

Using a placeholder value:

Using a regular expression to define something is not:

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related