0

I'm encoding a text in XML for an ehumanities project using Oxygen.

The file came pre-encoded with several tags, most of them were wrongly placed, so I had to tidy it up a lot. Most of it is done, but one major issue remains.

The page breaks <pb n="number"/> are wrong numbered. Strictly speaking their value is exactly one too little, which means <pb n="3"/> is supposed to be <pb n="4"/>.

There are over 300 of these page breaks.

Is there a way of incrementing every value with a Perl substitution?

I've managed to find every value with this regex pattern

<pb n="(\d+)"/>

and could replace it with:

<pb n="$1"/>

But how do I do a +1 operation on each value?

I'm not familiar with XPath and XSLT but am willing to learn it.

2
  • RegEx can't do this kind of logic, but most languages allow you to do some sort of replace callback where you can reference the match and perform a ++ operation. Commented May 14, 2014 at 20:10
  • How does Oxygen affect this question? Commented May 14, 2014 at 20:29

2 Answers 2

1

When working with XML, it's almost always advantageous to use an XML Parser. However, given the information provided, I think this "might" be an instance where it's reasonable to just use a regex.

Using a perl one-liner and regular expression

perl -i -pe 's{<pb n="\K(\d+)(?="/>)}{$1++}eg' file.xml

For am XML Parser, I'd recommend using either XML::Twig or XML::LibXML

Sign up to request clarification or add additional context in comments.

2 Comments

I think your answer pairs with mine quite nicely
It seriously bothers me to allow for the suggestion of a regex in this situation. It could work for given the information that's provided, but it's always the information that isn't stated that we have to worry about. I was going to also provide a parser solution, but you already did it, so yes, they do pair quite nicely :)
1

While you may have found a regex pattern that will match all the elements that you want to change, it is far from being reliable. An XML document could vary wildly from your example while still containing the equivalent data, but your code wouldn't pick it up.

For that reason it is always best to employ a proper XML parser.

I have used XML::LibXML here. XML::Twig is also a good choice.

Note that I have grabbed a part of your question and enclosed it in a root element for use as sample input data. It is always best if you can supply your own representative data in a question.

The XPath expression finds all attributes named n that belong to elements named pb. All of these attributes are checked within the loop to see if they consist of just one or more digits, in which case the value is incremented

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml(IO => *DATA);

for my $pb_n ( $doc->findnodes('//pb/@n') ) {
  my $val = $pb_n->getValue;
  if ( $val =~ /\A(\d+)\z/a ) {
    $pb_n->setValue($1 + 1);
  }
}

print $doc->toString;

__DATA__
<root>
  The page breaks <pb n="number"/> are wrong numbered. Strictly speaking 
  their value is exactly one too little, which means <pb n="3"/> is 
  supposed to be <pb n="4"/>.
</root>

output

<?xml version="1.0"?>
<root>
  The page breaks <pb n="number"/> are wrong numbered. Strictly speaking 
  their value is exactly one too little, which means <pb n="4"/> is 
  supposed to be <pb n="5"/>.
</root>

7 Comments

Thank you for your answer. Unfortunately, I have no clue about on how to use an xml parser. So I'm going to learn that first to give your suggestion a try.
Is there a way to do this in XSLT? Because I just can't figure out how to use perl or an xml parser.
@Basti: What is the problem? My solution does what you asked, using XML::LibXML to parse the XML data
Sorry, but I am very new to XML and just don't know what an XML parser is or how to use it. I even managed to download LibXML but I have no idea how to use it.
@Basti: I don't understand what is confusing you. My solution shows you how to use XML::LibXML. You don't need to know any more than that, although ideally you would understand how it works. The only change you need to make is to open the real file that you want to use as input and put it in place of DATA in the load_xml method call
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.