0

I need to read lines from an XML file and parse them into fields. A line is defined as text starting with a < and ending with />. It may be a single line or multiple lines separated by CR/LF. Here is a typical line:

<Label Name="lblIncidentTypeContent" Increasable="true" Left="140" Top="60"
 Width="146 SpeechField="IncidentType_V" TextAlign="MiddleLeft" WidthPixel="-180"
 WidthPercent="50" />

Once I've read the line, I then need to parse it into fields such as Name, Left, Width, etc. I then want to output a CSV with the data in a particular order. Then read the next line until EOF.

It's been a long time since I did Perl (or any other kind of) programming. Any help is welcome.

2
  • 1
    If you are using perl please tag your question with it. Commented Oct 23, 2013 at 19:58
  • 2
    I would suggest to use a full-fledged xml parser library of the language of your use, rather than parse it by hand Commented Oct 23, 2013 at 20:00

2 Answers 2

3

Don't view XML as line-based data, as it isn't. Rather, use a good XML parser, of which Perl has plenty.

Do not use XML::Simple!

Its own documentation says it is deprecated:

The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended.

The major problems with this module are the large number of options and the arbitrary ways in which these options interact - often with unexpected results.

So we're gonna use XML::LibXML module, which interfaces with the external libxml2 library from the GNOME project. This has the advantage that we can use XPath expressions to query our data. For reading from or writing to CSV, the Text::CSV module should be used.

use strict; use warnings;
use XML::LibXML;
use Text::CSV;

# load the data
my $data = XML::LibXML->load_xml(IO => \*STDIN) or die "Can't parse the XML";

# prepare CSV output:
my $csv = Text::CSV->new({ binary => 1, escape_char => "\\", eol => "\n" });
# Text::CSV doesn't like bareword filehandles
open my $output, '>&:utf8', STDOUT or die "Can't dup STDOUT: $!";

my @cols  = qw/ name left width /; # the column names in the CSV
my @attrs = qw/ Name Left Width /; # the corresponding attr names in the XML

# print the header
$csv->print($output, \@cols);

# extract data
for my $label ($data->findnodes('//Label')) {
  my @fields = map { $label->getAttribute($_) } @attrs;
  $csv->print($output, \@fields);
}

Test data (I took the liberty to close the value of the Width attr):

<foo>
  <Label Name="lblIncidentTypeContent" Increasable="true" Left="140" Top="60"
    Width="146" SpeechField="IncidentType_V" TextAlign="MiddleLeft" WidthPixel="-180"
    WidthPercent="50" />
  <Label Name="Another TypeContent" Increasable="true"
         Width="123"                SpeechField="IncidentType_V"
         Left="41,42"               Top="13"
         TextAlign="TopLeft"        WidthPixel="-180"
         WidthPercent="50"
  />
</foo>

Output:

name,left,width
lblIncidentTypeContent,140,146
"Another TypeContent","41,42",123
Sign up to request clarification or add additional context in comments.

2 Comments

this looks what I'm looking for. One question though, does this code assume that the attributes are always in the same order for each record? If so, that won't work for my data.
@EdWall No, the code does not assume any order: It uses the order in @attrs. A map {BLOCK} @items is just a fancy foreach (@items) {BLOCK} loop. XML attributes are essentially unordered.
1

Well, this being Perl you have several ways to do it:

  • brute force. Slurp the file in, and track when you come across an opening < brace. When you do, start collecting name/value pairs. When you see a closing brace, stop. Not as easy as it sounds because you have to handle possibly nested XML elements.
  • slight force. Load the file using a basic library like XML::Simple and then spit it out in a format of your choosing using Data::Dumper. The former gives you a hash and then you can play with the keys and values all your like.
  • Use a XML library. There are quite a few in CPAN, ranging from ones that are very close to the underlying libxml semantics and ones that are very abstract.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.