How to get started multithreading in Perl

Question

I have a perl program that takes over 13 hours to run. I think it could benefit from introducing multithreading but I have never done this before and I'm at a loss as to how to begin.

Here is my situation: I have a directory of hundreds of text files. I loop through every file in the directory using a basic for loop and do some processing (text processing on the file itself, calling an outside program on the file, and compressing it). When complete I move on to the next file. I continue this way doing each file, one after the other, in a serial fashion. The files are completely independent from each other and the process returns no values (other than success/failure codes) so this seems like a good candidate for multithreading.

My questions:

How do I rewrite my basic loop to take advantage of threads? There appear to be several moduals for threading out there.
How do I control how many threads are currently running? If I have N cores available, how do I limit the number of threads to N or N - n?
Do I need to manage the thread count manually or will Perl do that for me?

Any advice would be much appreciated.

Grab the list of files, then use a Parallel::ForkManager loop in which the processor is launched using exec. — ikegami
– ikegami, Commented Nov 18, 2014 at 15:18
If your program is IO-bound (and it sounds like it might be), then multithreading is not going to speed up your program. It might actually slow it down! — AKHolland
– AKHolland, Commented Nov 18, 2014 at 15:18
@ikegami It depends, and is certainly worth doing some profiling before diving into rewriting his program. — AKHolland
– AKHolland, Commented Nov 18, 2014 at 15:21
@AKHolland, Profilling? You mean benchmarking. Hard to do accurately because of caching, but the following would give an idea: time bash -c 'extprog file1; extprog file2' vs time bash -c 'extprog file1 & extprog file2' — ikegami
– ikegami, Commented Nov 18, 2014 at 15:28

ikegami · Accepted Answer · 2014-11-19 19:50:00Z

4

Since your threads are simply going to launch a process and wait for it to end, best to bypass the middlemen and just use processes. Unless you're on a Windows system, I'd recommend Parallel::ForkManager for your scenario.

use Parallel::ForkManager qw( );

use constant MAX_PROCESSES => ...;

my $pm = Parallel::ForkManager->new(MAX_PROCESSES);

my @qfns = ...;

for my $qfn (@qfns) {
   my $pid = $pm->start and next;
   exec("extprog", $qfn)
      or die $!;
}

$pm->wait_all_children();

If you wanted you avoid using needless intermediary threads in Windows, you'd have to use something akin to the following:

use constant MAX_PROCESSES => ...;

my @qfns = ...;

my %children;
for my $qfn (@qfns) {
   while (keys(%children) >= MAX_PROCESSES) {
      my $pid = wait();
      delete $children{$pid};
   }

   my $pid = system(1, "extprog", $qfn);
   ++$children{$pid};
}

while (keys(%children)) {
   my $pid = wait();
   delete $children{$pid};
}

edited Nov 19, 2014 at 19:50

answered Nov 18, 2014 at 15:31

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

craigm Over a year ago

Thanks much. It is a windows OS unfortunately since the external program I'm calling is Windows based. I can't say I completely understand your comment about Parallel:ForkManager and it's performance on Windows but it sounds like it might still be an option for my situation. I will give it a whirl. Many thanks...

mpapec Over a year ago

Threads when using windows? Btw, is there some significant difference between forks and Parallel::ForkManager?

thonnor Over a year ago

I've had some decent performance improvements on Windows using Parallel::ForkManager. Especially with a bulk copying program I wrote. I highly recommend it. The biggest problem with Windows is... Windows (apologies, I'm a *nix groupie)

Sobrique Over a year ago

Parallel::ForkManager isn't significantly different, but it does iron out a few of the gotchas (like cascading forks).

ikegami Over a year ago

@Sobrique, you might be interested in the update to my answer.

Sobrique · Accepted Answer · 2014-11-18 16:25:19Z

0

Someone's given your a forking example. Forks aren't native on Windows, so I'd tend to prefer threading.

For the sake of completeness - here's a rough idea of how threading works (and IMO is one of the better approaches, rather than respawning threads).

#!/usr/bin/perl

use strict;
use warnings;

use threads;

use Thread::Queue;

my $nthreads = 5;

my $process_q = Thread::Queue->new();
my $failed_q  = Thread::Queue->new();

#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.

sub worker {
    #NB - this will sit a loop indefinitely, until you close the queue.
    #using $process_q -> end
    #we do this once we've queued all the things we want to process
    #and the sub completes and exits neatly.
    #however if you _don't_ end it, this will sit waiting forever.
    while ( my $server = $process_q->dequeue() ) {
        chomp($server);
        print threads->self()->tid() . ": pinging $server\n";
        my $result = `/bin/ping -c 1 $server`;
        if ($?) { $failed_q->enqueue($server) }
        print $result;
    }
}

#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);

#we 'end' process_q  - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();

#start some threads
for ( 1 .. $nthreads ) {
    threads->create( \&worker );
}

#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
    $thr->join();
}

#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
    print "$server failed to ping\n";
}

If you need to move complicated data structures around, I'd recommend having a look at Storable - specifically freeze and thaw. These will let you shuffle around objects, hashes, arrays etc. easily in queues.

Note though - for any parallel processing option, you get good CPU utilisation, but you don't get more disk IO - that's often a limiting factor.

answered Nov 18, 2014 at 16:25

Sobrique

53.6k8 gold badges63 silver badges107 bronze badges

12 Comments

mpapec Over a year ago

:shared should perform better than Storable.

Sobrique Over a year ago

It probably would, but I find it gets a bit unpleasant when it comes to nested hashes and objects.

ikegami Over a year ago

@mpapec, Thread::Queue shares the value, but that's no good for blessed variables. That's when you use Storable. If you want to use Storable, use Thread::Queue::Any instead of Thread::Queue as it Storable to stringify queued values.

ikegami Over a year ago

@mpapec, Between threads in a process: You could transfer the file descriptor number and reopen it in the other thread. (The tricky part is making sure the sender doesn't close it before the receiver reopens it.) Between processes: Aside from parent to child inheritance, some unix have a system call that can send a file handle from one process to another (sendmsg? can't remember). I don't know if Windows has something similar.

ikegami Over a year ago

@mpapec, No. You can have two Perl handles use the same system handle, and you can have two system handles that are dups of each other (e.g. a child's STDOUT is often a dup of its parent's). In both cases, both handles are useable, subject to the collisions you'd expect if both try to use it at the same time.

|

Collectives™ on Stack Overflow

How to get started multithreading in Perl

2 Answers 2

5 Comments

12 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

12 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related