1

Well I have already write the script which basically takes xml file as input and extract the text for specific XML tags and it's working. But it's not smart enough to get the multiline text and also allow special characters. It's very important that text format should be keep intact as it's defined under tags.

Below is the XML input:

<nick>Deminem</nick>
<company>XYZ Solutions</company>
<description>
  /**
   * 
   *  «Lorem» ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
   *  tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. 
   *  At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd 
   *  no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit 
   *  consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore
   *  magna aliquyam erat, sed diam voluptua.
   *
   **/
</description> 

The above script extract the text of each specific tag and assign to new valueArray. My command over sed is basic but always willing to go the extra mile.

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do

OUT=`grep ${tagsArray[${i}]} filename.xml | tr -d '\t' | sed -e 's/^<.*>\([^<].*\)<.*>$/\1/' `

valueArray[${i}]=${OUT}
done 

2 Answers 2

3

Parsing XML with regexp leads to trouble eventually, just as you have experienced. Take the time to learn enough XSL (there are many tutorials) to transform the XML properly, using for example xsltproc.

Edit:

After trying out a few command line xml utilities, I think xmlstarlet could be the tool for you. The following is untested, and assumes that filename.xml is a proper xml file (i.e. has a single root element).

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do
    valueArray[${i}] = `xmlstarlet sel -t -v "/root/$tagsArray[i]" filename.xml`
done
Sign up to request clarification or add additional context in comments.

7 Comments

@AnderLindahi - Yeah that's true parsing XML through sed/awk is not an easy job because these tools are not meant for smart XML processing. But unfortunately it's my requirement to stick with schell script using sed.
@AnserLindahi - Is xsltproc comes preinstalled package with Mac OSX and Unix?
@Deminem: Making it a requirement to use shell script is like requiring someone to cut down a tree with a screwdriver. It can be done but it's not pretty.
@Jim: Making it a requirement to use shell script is important in my scenario because don't want the dependency of installing any third-party tool in order to install some custom-templates which can be easily done through shell script. The only thing ending up with reading the config settings which are in XML format. IF you have any better suggestions to replace my config settings data format with same <key & value> pair then please let me know.
Deminem: Is it up to you how the configuration is stored? Is your shell script the only thing that will read it?
|
0
#!/bin/sh
filePath=$1 #XML file path
tagName=$2  #Tag name to fetch values
awk '!/<.*>/' RS="<"$tagName">|</"$tagName">" $filePath

1 Comment

The RS definition is pretty quirky, in that variables get substituted inside of double quotes, so there's no reason to keep them out of the quoted string. If you want to be a little more explicit about the variable names, you can always put them in curly braces, e.g. RS="<${tagName}>|</${tagName}>'. But all that aside, regex is insufficient for parsing XML because it can nest. E.g., if you can have the same-named tag inside itself, this code will fail.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.