xml parsing with simple shell scripting

Question

can some one please help me on getting xml data into shell scripting

here is my requirement.

I need to print CHILD value along with attribute value of CHILD and parent if the CHILD value is greater than 100

here is my data

<mydata>
    <parent detail="school1">
        <CHILD attribute="0">0</CHILD>
        <CHILD attribute="1">1932</CHILD>
        <CHILD attribute="2">0</CHILD>
        <CHILD attribute="3">500</CHILD>
        <CHILD attribute="4">0</CHILD>
        <CHILD attribute="5">0</CHILD>
        <CHILD attribute="6">7819</CHILD>
        <CHILD attribute="7">0</CHILD>
        <CHILD attribute="8">299</CHILD>
        <CHILD attribute="9">0</CHILD>
    </parent>
    <parent detail="school2">
        <CHILD attribute="0">1</CHILD>
        <CHILD attribute="1">7000</CHILD>
        <CHILD attribute="2">0</CHILD>
        <CHILD attribute="3">0</CHILD>
        <CHILD attribute="4">600</CHILD>
        <CHILD attribute="5">0</CHILD>
        <CHILD attribute="6">11674</CHILD>
        <CHILD attribute="7">0</CHILD>
        <CHILD attribute="8">489</CHILD>
        <CHILD attribute="9">0</CHILD>
    </parent>
</mydata>

my external file values are like this childvalue_limits.txt file

attribute0=100
attribute1=60
attribute3=80
attribute4=90
attribute5=100
attribute6=90
attribute7=50
attribute8=80
attribute9=70

I need to pass this file as argument to script and to take these values dynamically into the condition..

current code

sed 's|><|>\n<|g' $WORKING_PATH/xml_detail.log | awk -F'"|<|>' '/parent detail/{p=$3} /CHILD attribute/{att=$3;val=$5;if(val>100)print  "child value on " p, "attribute "att,"is at value: "val ,"\n"}'

current output

child value on school2 attribute 1 is at value 1000
child value on school2 attribute 4 is at value 600
.....
.....

required output should be like this

child value on school2 attribute 1 is at value 1000 and threshold is 60
child value on school2 attribute 4 is at value 600 and threshold is 90
.....
.....

please note: threshold value is the dynamic value passed to if condition through a separate file called childvalue_limits.txt

Your question is ambiguous. Do you mean you need the child value and attribute and parent for all children whose value exceeeds 100. or do you mean you need child value and attribute, and where the child value exceeeds 100 you need the parent as well? What have you tried so far? — Mark Setchell
– Mark Setchell, Commented Jul 18, 2014 at 14:45
Show an example output along with what you have done so far. — konsolebox
– konsolebox, Commented Jul 18, 2014 at 17:13
If you want to parse XML, use an XML parser (which of course can be run within a shell script). Using awk or any other regular expression based program will use a regular grammar, whereas XML is context-free and can therefore by definition not be correctly parsed by regex. — dirkk
– dirkk, Commented Jul 21, 2014 at 12:50

dirkk · Accepted Answer · 2014-07-21 14:18:59Z

1

You can not (correctly) parse XML using regular expression. XML is a context-free language, which is more expressive than a grammar based on regular expressions. See the Chomsky hierarchy for details. That is also the reason why you run into troubles with newlines when using regular expressions.

Hence, it is better (and easier and more stable) to use a proper XML parser. As I am most familiar with BaseX (full disclousure: I am also associated with the project) I will use it.

When using the zip version, you can simple run the file bin/basex. The following XPath 3.0 expression should give you the correct output, simply concatenating the different values:

for $c in /mydata/parent/CHILD[. > 100] return $c/parent::parent/@detail || " " || $c/@attribute || " " || $c/data() || "&#10;"

Assuming your xml file is named mydata.xml you can execute this XPath simply by issueing the following command (i.e. this can be done in your shell script):

basex -i mydata.xml -q 'for $c in /mydata/parent/CHILD[. > 100] return $c/parent::parent/@detail || " " || $c/@attribute || " " || $c/data() || "&#10;"'

answered Jul 21, 2014 at 14:18

dirkk

6,2285 gold badges35 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

dirkk Over a year ago

@MarkSetchell You are very welcome - Welcome to the magical and mysterious journey that is XPath/XQuery processing ;-) Yes, we do have a simply homebrew install; for all debian (or debian-derived) users it should also be in the central repository (although slightly outdated, if I remember correctly).

Mark Setchell Over a year ago

+1 Excellent - that works nicely, thank you. For any OSX Mac users out there, I installed basex very simply with brew install basex.

Mark Setchell Over a year ago

Fixed a typo... and re-commented.

Mark Setchell · Accepted Answer · 2014-08-05 08:37:35Z

0

EDITED AGAIN

Ok, I have changed the code to read a file of input limits. It looks complicated but it is is not - you can remove all the lines that have the word "DEBUG" in them if you want to. The # is the start of a comment.

#!/bin/bash

awk -F'"|<|>' '
   FNR==NR           {
                       split($0,f,"=");  # Split line on "=" sign into array f[]
                       gsub(/[[:alpha:]]/,"",f[1]); # Remove non-digits
                       limits[f[1]]=f[2]; # Save for comparison later
                       print "DEBUG: limits[",f[1],"]=",f[2];
                       next
                     }
   /parent detail/   {
                       p=$3
                       print "DEBUG: parent detail=",p;
                     }
   /CHILD attribute/ {
                       att=$3;val=$5;
                       print "DEBUG: att=",att,",val=",val; 
                       if(val>limits[att])print p,att,val,limits[att]
                     }
   ' limits.txt xml

You will see at the end of the script that it reads in BOTH your files - limits.txt and xml. In the script, the block in curly braces that starts FNR==NR means that the following code only applies to reading and parsing limits.txt.

If you want to see the output without DEBUG messages, just run

./script | grep -v DEBUG

EDITED

Your code works fine for me with your revised data. Here is my output:

node2 1 1932
node2 6 7819
node1 1 1924
node1 6 11674

I assume you mean you want to avoid XML parsers and just use standard tools like awk and sed to achieve this, so I'll go with awk

awk -F'"|<|>' '/parent detail/{p=$3} /CHILD attribute/{att=$3;val=$5;if(val>100)print p,att,val}' xml

Output:

school1 1 1932
school1 3 500
school1 6 7819
school1 8 299
school2 1 7000
school2 4 600
school2 6 11674
school2 8 489

So, it sets the separator to any of ", < or >. Then, when it sees lines with the words "parent detail" it saves the value in p. When it sees lines with the words CHILD attribute it extracts the attribute and value. If the value is over 100, it prints the parent, attribute and value.

It assumes your XML is in a file called xml.

edited Aug 5, 2014 at 8:37

answered Jul 18, 2014 at 15:12

Mark Setchell

210k32 gold badges310 silver badges504 bronze badges

23 Comments

Praveen Over a year ago

thanks Mark but the above code not working while changing values in the xml data. awk -F'"|<|>' '/ALLQUEUEDEPTHS server/{p=$3} /QUEUE_DEPTH queue/{att=$3;val=$5;if(val>100)print p,att,val}' ./myfile.xml

Mark Setchell Over a year ago

Can you click edit underneath your question and paste in an XML file that my code doesn't work for please?

Praveen Over a year ago

it doesn't allow me to paste xml file is there any way to send xml data to you ?

Mark Setchell Over a year ago

Put it in the same way as you put the original data in.

Mark Setchell Over a year ago

I have updated my answer - are you using GNU awk, or can you try using it - installed as gawk maybe?

|

Collectives™ on Stack Overflow

xml parsing with simple shell scripting

2 Answers 2

3 Comments

23 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

23 Comments

Your Answer

Sign up or log in

Post as a guest

Related