How to use Perl to merge multi-line into single line

Question

I try to use Perl to covert from the input text file format to the output text file format shown, but not successfully.

Can anyone help?

Input:

row1 multiline 1
row1 multiline 2
row1 multiline 3
row2 multiline 1
row2 multiline 2

Expected Output:

row1 multiline 1 multiline 2 multiline 3
row2 multiline 1 multiline 2

Will the upper limit on the rows for a given key be 3 lines of input, or is that 'indefinite'? Is the row tag always a single word delimited by white space? Is space all that's required between the elements that were on separate lines? — Jonathan Leffler
– Jonathan Leffler, Commented Jul 3, 2015 at 19:44

Borodin · Accepted Answer · 2015-08-05 06:16:32Z

3

This will do as you ask. It checks to see whether the first field on each line has changed to decide whether to continue outputting the current line or to start a new one

It expects the path to the input file as a parameter on the command line

use strict;
use warnings;

my $row;

while ( <> ) {

    next unless /\S/;
    chomp;

    my ( $new_row, $rest ) = split ' ', $_, 2;

    if ( defined $row and $row eq $new_row ) {
        print ' ', $rest;
    }
    else {
        print "\n" if defined $row;
        print $_;
        $row = $new_row;
    }
}

print "\n";

output

row1 multiline 1 multiline 2 multiline 3
row2 multiline 1 multiline 2

edited Aug 5, 2015 at 6:16

answered Jul 3, 2015 at 19:40

Borodin

127k9 gold badges72 silver badges146 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonathan Leffler Over a year ago

If you use my $row = '';, you could avoid testing for defined $row in the body of the loop, could you not?

Borodin Over a year ago

@JonathanLeffler: Probably, but I have coded for the possibility that $new_row may be the string 0, which is false

ndnenkov · Accepted Answer · 2015-07-03 17:59:10Z

1

In one regex? Not very likely. The same regex multiple times however is plausible. Just match against this until it stops matching:

while ($input =~ s/row(\d+)((?: multiline \d+)+)\n+row\1/row$1$2/gm){}

The loop will reduce the amount of unmerged lines by half with every iteration. Hence it will loop only O(ln(n)) times.

You can see it in action here: https://ideone.com/RP30h6

The above solution is more esoteric then practical. Here is how a real solution might look like:

my $row_number = 0;
my ($row, $column);

while ($input =~ /(row(\d+) multiline (\d+))/gm) {
  if ($row_number != $2) {
    $row_number = $2;
  } else {
    $row = $1;
    $column = $3;
    $input =~ s/\n+$row/ multiline $column/g;
  }
}

Demo: https://ideone.com/Mk2QqZ

edited Jul 3, 2015 at 17:59

answered Jul 3, 2015 at 17:28

ndnenkov

36.2k9 gold badges80 silver badges109 bronze badges

6 Comments

Borodin Over a year ago

Did anyone mention doing it in a single regex?

Borodin Over a year ago

It seems likely that row and multiline in the example data are simply placeholders. If not then the data isn't very much use!

ndnenkov Over a year ago

@Borodin, unless OP gives us something more specific we can't provide a general solution. For example what happens with your solution if row contains spaces? This is why I hardcoded the values and looked into the numbers as the only variable thing. And yes - I got challenged to do it in one regex :)

Borodin Over a year ago

Was that challenge a private one, because I can't see it anywhere

ndnenkov Over a year ago

Yes, but I put the effort so why shouldn't I share?

|

score 1 · Accepted Answer · 2015-07-03 20:01:18Z

1

This can be done using a replacement callback.
In Perl, this is typically accomplished by using the s///e evaluation form.

This just gets the common row block in capture buffers.
Buffer 1 is the first row, buffer 3 is the remaining common row's.

These are passed to the merge sub.
The merge sub trims out the common row's via another regex,
then combines the first row with the common row's.
It then gets passed back as a replacement.

Perl code:

use strict;
use warnings;

$/ = undef;

my $input = <DATA>;

sub mergeRows {
    my ($first_row, $other_rows) = @_;
    $other_rows =~ s/(?m)\s*^\w+\s*(.*)(?<!\s)\s*/$1 /g;
    return $first_row . " " . $other_rows . "\n";
}

$input =~ s/(?m)(^(\w+).*)(?<!\s)\s+((?:\s*^\2.*)+)/ mergeRows($1,$3) /eg;

print $input, "\n";

__DATA__
row1 multiline 1

row1 multiline 2

row1 multiline 3

row2 multiline 1

row2 multiline 2

Output:

row1 multiline 1 multiline 2 multiline 3

row2 multiline 1 multiline 2

Main regex:

 (?m)                          # Multi-line mode
 (                             # (1 start), First of common row
      ^ 
      ( \w+ )                       # (2), common row label
      .* 
 )                             # (1 end)
 (?<! \s )                     # Force trim of trailing spaces
 \s+                           # Consume a newline, also get all the next whitespaces
 (                             # (3 start), Remaining common row's
      (?:
           \s* ^ \2  .* 
      )+
 )                             # (3 end)

Merge sub regex:

 (?m)                          # Multi-line mode
 \s*                           # remove
 ^ \w+ \s*                     # remove
 ( .* )                        # (1), What will be saved
 (?<! \s )                     # remove, force trim of trailing spaces
 \s*                           # remove, possibly many newlines (whitespace)

edited Jul 3, 2015 at 20:01

answered Jul 3, 2015 at 19:19

user557597

2 Comments

Borodin Over a year ago

I would be very surprised if the real first fields start with row!

user557597 Over a year ago

@Borodin - Changed row\d+ to \w+. Its just a place holder example, not really the thrust of the OP's request. But, it should be defined better, right now its unknown.

Sobrique · Accepted Answer · 2015-07-03 20:58:48Z

1

You have a key field as the first word, and then the rest of the line as a value.

So I would approach your problem like this:

#!/usr/bin/env perl
use strict;
use warnings;

my %rows;
while (<DATA>) {
    my ( $key, $rest_of_line ) = (m/^(\w+) (.*)/);
    push( @{ $rows{$key} }, $rest_of_line );
}

foreach my $key ( sort keys %rows ) {
    print "$key ", join( " ", @{ $rows{$key} } ), "\n";
}

__DATA__
row1 multiline 1
row1 multiline 2
row1 multiline 3
row2 multiline 1
row2 multiline 2

It's slightly different approach to the others, in that we read in each line into a hash, then output the hash.

It doesn't maintain the order of your original file, but instead sorts in 'row value' order.

answered Jul 3, 2015 at 20:58

Sobrique

53.6k8 gold badges63 silver badges107 bronze badges

2 Comments

Jonathan Leffler Over a year ago

The disadvantage of this is that it reads the whole file into memory before producing any output. The advantage is that even if the keys (values in column 1) are not sorted, so there may be entries tagged row1 at lines 1, 2, 3, 30, 300, and 253,231, they will all be put together in the output.

Sobrique Over a year ago

Yes, quite true. It's not suitable for every use case - but in some, collating across the file might be desirable.

Collectives™ on Stack Overflow

How to use Perl to merge multi-line into single line

4 Answers 4

output

2 Comments

6 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

output

2 Comments

6 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related