File splitting using Perl

Question

I'm trying to split a large text files into several text files. I found another thread from a few years ago with a very similar premise but couldn't find my exact situation.

https://unix.stackexchange.com/a/64691/183674

How would I split the following data if the first line didn't start with 00:00:00:00?

00:00:00:00 00:00:05:00 01SC_001.jpg
00:00:14:29 00:00:19:29 01SC_002.jpg
00:01:07:20 00:01:12:20 01SC_003.jpg
00:00:00:00 00:00:03:25 02MI_001.jpg
00:00:03:25 00:00:08:25 02MI_002.jpg
00:00:35:27 00:00:40:27 02MI_003.jpg
00:00:00:00 00:00:05:00 03Bi_001.jpg
00:00:05:19 00:00:10:19 03Bi_002.jpg
00:01:11:17 00:01:16:17 03Bi_003.jpg
00:00:00:00 00:00:05:00 04CG_001.jpg
00:00:11:03 00:00:16:03 04CG_002.jpg
00:01:12:25 00:01:17:25 04CG_003.jpg

Here's the code for reference:

#!/usr/bin/env perl

use strict;
use warnings;

open(my $infh, '<', 'ABC_TabDelim.txt') or die $!;

my $outfh;
my $filecount = 0;
while ( my $line = <$infh> ) {
    if ( $line =~ /^00:00:00:00/ ) {
        close($outfh) if $outfh;
        open($outfh, '>', sprintf('ABC%02d_TabDelim.txt', ++$filecount)) or die $!;        
    }
    print {$outfh} $line or die "Failed to write to file: $!";
}

close($outfh);
close($infh);

I tried adding a print $line; in the next line after the while statement to attempt to make it read line by line as shown in other tutorials but this did not rectify the issue.

I would appreciate any input.

edit: So for an example like

    00:01:16:17 00:00:05:00 01SC_001.jpg
    00:00:14:29 00:00:19:29 01SC_002.jpg
    00:01:07:20 00:01:12:20 01SC_003.jpg
    00:00:00:00 00:00:03:25 02MI_001.jpg
    00:00:03:25 00:00:08:25 02MI_002.jpg
    00:00:35:27 00:00:40:27 02MI_003.jpg
    00:00:00:00 00:00:05:00 03Bi_001.jpg
    00:00:05:19 00:00:10:19 03Bi_002.jpg
    00:01:11:17 00:01:16:17 03Bi_003.jpg
    00:00:00:00 00:00:05:00 04CG_001.jpg
    00:00:11:03 00:00:16:03 04CG_002.jpg
    00:01:12:25 00:01:17:25 04CG_003.jpg

I would like to get three seperate files, respectively containing

00:00:00:00 00:00:03:25 02MI_001.jpg
00:00:03:25 00:00:08:25 02MI_002.jpg
00:00:35:27 00:00:40:27 02MI_003.jpg

00:00:00:00 00:00:05:00 03Bi_001.jpg
00:00:05:19 00:00:10:19 03Bi_002.jpg
00:01:11:17 00:01:16:17 03Bi_003.jpg

00:00:00:00 00:00:05:00 04CG_001.jpg
00:00:11:03 00:00:16:03 04CG_002.jpg
00:01:12:25 00:01:17:25 04CG_003.jpg

discarding the first three lines.

I expect the code to make a file for every occurrence of 00:00:00:00, ending just before the next instance. How would I implement this if all of the lines with 00:00:00:00's were shifted down a few lines? — wittywater
– wittywater, Commented Aug 8, 2016 at 14:14
You should show us the expected output from your sample data, and your sample data should illustrate any corner cases that have to be dealt with (not having 00:00:00:00 in the first column of the first row, for example). — Jonathan Leffler
– Jonathan Leffler, Commented Aug 8, 2016 at 14:25

Jonathan Leffler · Accepted Answer · 2016-08-08 15:41:04Z

1

Does modifying the condition in the loop like this not do the job?

if ($line =~ /^00:00:00:00/ || !$outfh)

Suppose the first line does not start 00:00:00:00 (a 'zero marker'). The regex match fails, but the file isn't open so the || !$outfh condition is true. The code in the if body skips the close and opens the new file and the line is written to the new file. Thereafter, the file is open, so the second half of the condition doesn't change the decision making (except to slow it down marginally and probably immeasurably).

The question was clarified since I first proffered my solution. If you want to discard the rows before the first zero marker, modify the print to print only if the file handle is open (instead of the modified condition to open the file if the first line does not start with a zero marker).

print $outfh $line or die "Failed to write to file: $!" if $outfh;

edited Aug 8, 2016 at 15:41

answered Aug 8, 2016 at 14:28

Jonathan Leffler

759k145 gold badges961 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

wittywater Over a year ago

It's working with your proposed change, now I just need to understand the significance of the second condition :)

Jonathan Leffler Over a year ago

Suppose the first line starts 01. The regex match fails, but the file isn't open so the or condition is true. The code skips the close and opens the new file and the line is written. Thereafter, the file is open so the second half of the condition doesn't change the decision making (except to slow it down marginally and probably immeasurably).

wittywater Over a year ago

That clarifies my confusion, I appreciate the help.

Jonathan Leffler Over a year ago

The question was clarified since I proffered my solution. If you want to discard the rows before the first zero marker, modify the print to print only if the file handle is open.

Collectives™ on Stack Overflow

File splitting using Perl

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related