Extract text from XML tags using sed - shell script

Question

Well I have already write the script which basically takes xml file as input and extract the text for specific XML tags and it's working. But it's not smart enough to get the multiline text and also allow special characters. It's very important that text format should be keep intact as it's defined under tags.

Below is the XML input:

<nick>Deminem</nick>
<company>XYZ Solutions</company>
<description>
  /**
   * 
   *  «Lorem» ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
   *  tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. 
   *  At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd 
   *  no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit 
   *  consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore
   *  magna aliquyam erat, sed diam voluptua.
   *
   **/
</description>

The above script extract the text of each specific tag and assign to new valueArray. My command over sed is basic but always willing to go the extra mile.

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do

OUT=`grep ${tagsArray[${i}]} filename.xml | tr -d '\t' | sed -e 's/^<.*>\([^<].*\)<.*>$/\1/' `

valueArray[${i}]=${OUT}
done

Anders Lindahl · Accepted Answer · 2011-04-27 19:48:55Z

3

Parsing XML with regexp leads to trouble eventually, just as you have experienced. Take the time to learn enough XSL (there are many tutorials) to transform the XML properly, using for example xsltproc.

Edit:

After trying out a few command line xml utilities, I think xmlstarlet could be the tool for you. The following is untested, and assumes that filename.xml is a proper xml file (i.e. has a single root element).

tagsArray=( nick company description )
noOfElements=${#tagsArray[@]}

for (( i=0;i<$noOfElements;i++)); do
    valueArray[${i}] = `xmlstarlet sel -t -v "/root/$tagsArray[i]" filename.xml`
done

edited Apr 27, 2011 at 19:48

answered Apr 27, 2011 at 19:11

Anders Lindahl

43.3k9 gold badges93 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Deminem Over a year ago

@AnderLindahi - Yeah that's true parsing XML through sed/awk is not an easy job because these tools are not meant for smart XML processing. But unfortunately it's my requirement to stick with schell script using sed.

Deminem Over a year ago

@AnserLindahi - Is xsltproc comes preinstalled package with Mac OSX and Unix?

Jim Garrison Over a year ago

@Deminem: Making it a requirement to use shell script is like requiring someone to cut down a tree with a screwdriver. It can be done but it's not pretty.

Deminem Over a year ago

@Jim: Making it a requirement to use shell script is important in my scenario because don't want the dependency of installing any third-party tool in order to install some custom-templates which can be easily done through shell script. The only thing ending up with reading the config settings which are in XML format. IF you have any better suggestions to replace my config settings data format with same <key & value> pair then please let me know.

Anders Lindahl Over a year ago

Deminem: Is it up to you how the configuration is stored? Is your shell script the only thing that will read it?

|

Michael Petrotta · Accepted Answer · 2012-04-19 05:43:21Z

0

#!/bin/sh
filePath=$1 #XML file path
tagName=$2  #Tag name to fetch values
awk '!/<.*>/' RS="<"$tagName">|</"$tagName">" $filePath

edited Apr 19, 2012 at 5:43

Michael Petrotta

61.1k27 gold badges153 silver badges181 bronze badges

answered Apr 19, 2012 at 5:39

Sanjay

1

1 Comment

danfuzz Over a year ago

The RS definition is pretty quirky, in that variables get substituted inside of double quotes, so there's no reason to keep them out of the quoted string. If you want to be a little more explicit about the variable names, you can always put them in curly braces, e.g. RS="<${tagName}>|</${tagName}>'. But all that aside, regex is insufficient for parsing XML because it can nest. E.g., if you can have the same-named tag inside itself, this code will fail.

Collectives™ on Stack Overflow

Extract text from XML tags using sed - shell script

2 Answers 2

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related