0

I try to use Perl to covert from the input text file format to the output text file format shown, but not successfully.

Can anyone help?

Input:

row1 multiline 1
row1 multiline 2
row1 multiline 3
row2 multiline 1
row2 multiline 2

Expected Output:

row1 multiline 1 multiline 2 multiline 3
row2 multiline 1 multiline 2
1
  • Will the upper limit on the rows for a given key be 3 lines of input, or is that 'indefinite'? Is the row tag always a single word delimited by white space? Is space all that's required between the elements that were on separate lines? Commented Jul 3, 2015 at 19:44

4 Answers 4

3

This will do as you ask. It checks to see whether the first field on each line has changed to decide whether to continue outputting the current line or to start a new one

It expects the path to the input file as a parameter on the command line

use strict;
use warnings;

my $row;

while ( <> ) {

    next unless /\S/;
    chomp;

    my ( $new_row, $rest ) = split ' ', $_, 2;

    if ( defined $row and $row eq $new_row ) {
        print ' ', $rest;
    }
    else {
        print "\n" if defined $row;
        print $_;
        $row = $new_row;
    }
}

print "\n";

output

row1 multiline 1 multiline 2 multiline 3
row2 multiline 1 multiline 2
Sign up to request clarification or add additional context in comments.

2 Comments

If you use my $row = '';, you could avoid testing for defined $row in the body of the loop, could you not?
@JonathanLeffler: Probably, but I have coded for the possibility that $new_row may be the string 0, which is false
1

In one regex? Not very likely. The same regex multiple times however is plausible. Just match against this until it stops matching:

while ($input =~ s/row(\d+)((?: multiline \d+)+)\n+row\1/row$1$2/gm){}

The loop will reduce the amount of unmerged lines by half with every iteration. Hence it will loop only O(ln(n)) times.

You can see it in action here: https://ideone.com/RP30h6


The above solution is more esoteric then practical. Here is how a real solution might look like:

my $row_number = 0;
my ($row, $column);

while ($input =~ /(row(\d+) multiline (\d+))/gm) {
  if ($row_number != $2) {
    $row_number = $2;
  } else {
    $row = $1;
    $column = $3;
    $input =~ s/\n+$row/ multiline $column/g;
  }
}

Demo: https://ideone.com/Mk2QqZ

6 Comments

Did anyone mention doing it in a single regex?
It seems likely that row and multiline in the example data are simply placeholders. If not then the data isn't very much use!
@Borodin, unless OP gives us something more specific we can't provide a general solution. For example what happens with your solution if row contains spaces? This is why I hardcoded the values and looked into the numbers as the only variable thing. And yes - I got challenged to do it in one regex :)
Was that challenge a private one, because I can't see it anywhere
Yes, but I put the effort so why shouldn't I share?
|
1

This can be done using a replacement callback.
In Perl, this is typically accomplished by using the s///e evaluation form.

This just gets the common row block in capture buffers.
Buffer 1 is the first row, buffer 3 is the remaining common row's.

These are passed to the merge sub.
The merge sub trims out the common row's via another regex,
then combines the first row with the common row's.
It then gets passed back as a replacement.

Perl code:

use strict;
use warnings;

$/ = undef;

my $input = <DATA>;

sub mergeRows {
    my ($first_row, $other_rows) = @_;
    $other_rows =~ s/(?m)\s*^\w+\s*(.*)(?<!\s)\s*/$1 /g;
    return $first_row . " " . $other_rows . "\n";
}

$input =~ s/(?m)(^(\w+).*)(?<!\s)\s+((?:\s*^\2.*)+)/ mergeRows($1,$3) /eg;

print $input, "\n";

__DATA__
row1 multiline 1

row1 multiline 2

row1 multiline 3

row2 multiline 1

row2 multiline 2

Output:

row1 multiline 1 multiline 2 multiline 3

row2 multiline 1 multiline 2

Main regex:

 (?m)                          # Multi-line mode
 (                             # (1 start), First of common row
      ^ 
      ( \w+ )                       # (2), common row label
      .* 
 )                             # (1 end)
 (?<! \s )                     # Force trim of trailing spaces
 \s+                           # Consume a newline, also get all the next whitespaces
 (                             # (3 start), Remaining common row's
      (?:
           \s* ^ \2  .* 
      )+
 )                             # (3 end)

Merge sub regex:

 (?m)                          # Multi-line mode
 \s*                           # remove
 ^ \w+ \s*                     # remove
 ( .* )                        # (1), What will be saved
 (?<! \s )                     # remove, force trim of trailing spaces
 \s*                           # remove, possibly many newlines (whitespace)

2 Comments

I would be very surprised if the real first fields start with row!
@Borodin - Changed row\d+ to \w+. Its just a place holder example, not really the thrust of the OP's request. But, it should be defined better, right now its unknown.
1

You have a key field as the first word, and then the rest of the line as a value.

So I would approach your problem like this:

#!/usr/bin/env perl
use strict;
use warnings;

my %rows;
while (<DATA>) {
    my ( $key, $rest_of_line ) = (m/^(\w+) (.*)/);
    push( @{ $rows{$key} }, $rest_of_line );
}

foreach my $key ( sort keys %rows ) {
    print "$key ", join( " ", @{ $rows{$key} } ), "\n";
}

__DATA__
row1 multiline 1
row1 multiline 2
row1 multiline 3
row2 multiline 1
row2 multiline 2

It's slightly different approach to the others, in that we read in each line into a hash, then output the hash.

It doesn't maintain the order of your original file, but instead sorts in 'row value' order.

2 Comments

The disadvantage of this is that it reads the whole file into memory before producing any output. The advantage is that even if the keys (values in column 1) are not sorted, so there may be entries tagged row1 at lines 1, 2, 3, 30, 300, and 253,231, they will all be put together in the output.
Yes, quite true. It's not suitable for every use case - but in some, collating across the file might be desirable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.