0

I have a perl program that takes over 13 hours to run. I think it could benefit from introducing multithreading but I have never done this before and I'm at a loss as to how to begin.

Here is my situation: I have a directory of hundreds of text files. I loop through every file in the directory using a basic for loop and do some processing (text processing on the file itself, calling an outside program on the file, and compressing it). When complete I move on to the next file. I continue this way doing each file, one after the other, in a serial fashion. The files are completely independent from each other and the process returns no values (other than success/failure codes) so this seems like a good candidate for multithreading.

My questions:

  1. How do I rewrite my basic loop to take advantage of threads? There appear to be several moduals for threading out there.
  2. How do I control how many threads are currently running? If I have N cores available, how do I limit the number of threads to N or N - n?
  3. Do I need to manage the thread count manually or will Perl do that for me?

Any advice would be much appreciated.

14
  • 2
    Grab the list of files, then use a Parallel::ForkManager loop in which the processor is launched using exec. Commented Nov 18, 2014 at 15:18
  • If your program is IO-bound (and it sounds like it might be), then multithreading is not going to speed up your program. It might actually slow it down! Commented Nov 18, 2014 at 15:18
  • 1
    @AKHolland, File compression is usually CPU bound Commented Nov 18, 2014 at 15:19
  • @ikegami It depends, and is certainly worth doing some profiling before diving into rewriting his program. Commented Nov 18, 2014 at 15:21
  • @AKHolland, Profilling? You mean benchmarking. Hard to do accurately because of caching, but the following would give an idea: time bash -c 'extprog file1; extprog file2' vs time bash -c 'extprog file1 & extprog file2' Commented Nov 18, 2014 at 15:28

2 Answers 2

4

Since your threads are simply going to launch a process and wait for it to end, best to bypass the middlemen and just use processes. Unless you're on a Windows system, I'd recommend Parallel::ForkManager for your scenario.

use Parallel::ForkManager qw( );

use constant MAX_PROCESSES => ...;

my $pm = Parallel::ForkManager->new(MAX_PROCESSES);

my @qfns = ...;

for my $qfn (@qfns) {
   my $pid = $pm->start and next;
   exec("extprog", $qfn)
      or die $!;
}

$pm->wait_all_children();

If you wanted you avoid using needless intermediary threads in Windows, you'd have to use something akin to the following:

use constant MAX_PROCESSES => ...;

my @qfns = ...;

my %children;
for my $qfn (@qfns) {
   while (keys(%children) >= MAX_PROCESSES) {
      my $pid = wait();
      delete $children{$pid};
   }

   my $pid = system(1, "extprog", $qfn);
   ++$children{$pid};
}

while (keys(%children)) {
   my $pid = wait();
   delete $children{$pid};
}
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks much. It is a windows OS unfortunately since the external program I'm calling is Windows based. I can't say I completely understand your comment about Parallel:ForkManager and it's performance on Windows but it sounds like it might still be an option for my situation. I will give it a whirl. Many thanks...
Threads when using windows? Btw, is there some significant difference between forks and Parallel::ForkManager?
I've had some decent performance improvements on Windows using Parallel::ForkManager. Especially with a bulk copying program I wrote. I highly recommend it. The biggest problem with Windows is... Windows (apologies, I'm a *nix groupie)
Parallel::ForkManager isn't significantly different, but it does iron out a few of the gotchas (like cascading forks).
@Sobrique, you might be interested in the update to my answer.
0

Someone's given your a forking example. Forks aren't native on Windows, so I'd tend to prefer threading.

For the sake of completeness - here's a rough idea of how threading works (and IMO is one of the better approaches, rather than respawning threads).

#!/usr/bin/perl

use strict;
use warnings;

use threads;

use Thread::Queue;

my $nthreads = 5;

my $process_q = Thread::Queue->new();
my $failed_q  = Thread::Queue->new();

#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.

sub worker {
    #NB - this will sit a loop indefinitely, until you close the queue.
    #using $process_q -> end
    #we do this once we've queued all the things we want to process
    #and the sub completes and exits neatly.
    #however if you _don't_ end it, this will sit waiting forever.
    while ( my $server = $process_q->dequeue() ) {
        chomp($server);
        print threads->self()->tid() . ": pinging $server\n";
        my $result = `/bin/ping -c 1 $server`;
        if ($?) { $failed_q->enqueue($server) }
        print $result;
    }
}

#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);

#we 'end' process_q  - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();

#start some threads
for ( 1 .. $nthreads ) {
    threads->create( \&worker );
}

#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
    $thr->join();
}

#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
    print "$server failed to ping\n";
}

If you need to move complicated data structures around, I'd recommend having a look at Storable - specifically freeze and thaw. These will let you shuffle around objects, hashes, arrays etc. easily in queues.

Note though - for any parallel processing option, you get good CPU utilisation, but you don't get more disk IO - that's often a limiting factor.

12 Comments

:shared should perform better than Storable.
It probably would, but I find it gets a bit unpleasant when it comes to nested hashes and objects.
@mpapec, Thread::Queue shares the value, but that's no good for blessed variables. That's when you use Storable. If you want to use Storable, use Thread::Queue::Any instead of Thread::Queue as it Storable to stringify queued values.
@mpapec, Between threads in a process: You could transfer the file descriptor number and reopen it in the other thread. (The tricky part is making sure the sender doesn't close it before the receiver reopens it.) Between processes: Aside from parent to child inheritance, some unix have a system call that can send a file handle from one process to another (sendmsg? can't remember). I don't know if Windows has something similar.
@mpapec, No. You can have two Perl handles use the same system handle, and you can have two system handles that are dups of each other (e.g. a child's STDOUT is often a dup of its parent's). In both cases, both handles are useable, subject to the collisions you'd expect if both try to use it at the same time.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.