Script Processes in Parallel

Question

I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.

grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log

Then i tried alternative as below.

sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log

even this took 60+ seconds as well.

There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.

Why did you use -i with sed? Did that not modify the original log file? — Kusalananda
– Kusalananda ♦, Commented Apr 10, 2018 at 17:09
Is it a requirement to save the parsed log to separate files? — Kusalananda
– Kusalananda ♦, Commented Apr 10, 2018 at 17:38
-i is used to moving the lines from one file to another. Yes, that's the requirement. — Siva
– Siva, Commented Apr 10, 2018 at 19:24

Kusalananda · Accepted Answer · 2018-04-10 20:41:01Z

I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed or grep per IP address.

Instead, you may get away with a single use of sed, if you prepare carefully.

First, we have to create a sed script, and we do that from a file ip.list which contains the IP addresses, one address per line:

sed -e 'h' \
    -e 's/\./\\./g' \
    -e 's#.*#/^&[[:blank:]]/w /tmp/access-#' \
    -e 'G' \
    -e 's/\n//' \
    -e 's/$/.log/' ip.list >ip.sed

This sed stuff does, for each IP address,

Copy the address to the "hold space" (an extra buffer in sed).
Change . in the "pattern space" (the input line) into \. (to match the dots properly, your code did not do this).
Prepend ^ and append [[:blank:]]/w /tmp/access- to the pattern space.
Append the unmodified input line from the hold space to the pattern space with a newline in-between.
Delete that newline.
Append .log to the end of the line (and implicitly output the result).

For a file that contains

127.0.0.1
10.0.0.1
10.0.0.100

this would create the sed script

/^127\.0\.0\.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10\.0\.0\.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10\.0\.0\.100[[:blank:]]/w /tmp/access-10.0.0.100.log

Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100 would go into the /tmp/access-10.0.0.1.log file. Your code omitted this.

This can then be used on your log file (no looping):

sed -n -f ip.sed /var/log/http/access.log

I haven't ever tested writing to 1200 files from one and the same sed script. If it doesn't work, then try the below awk variation instead.

A similar solution with awk involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk invocation:

awk 'FNR == NR  { list[$1] = 1; next }
     $1 in list { name = $1 ".log"; print >>name; close name }' ip.list /var/log/http/access.log

Here, we give awk both the IP list and the log file at the same time. When NR == FNR we know we're still reading the first file (the list), and we add the IP numbers into the associative array list as keys, and continue with the next line of input.

If the FNR == NR condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list (this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.

We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk (or any utility) once per IP address.

I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.

Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep on the system in parallel:

Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like

xargs -P 4 -n 100 sh -c '
    for n do
        grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
    done' sh <ip.list

Here, xargs will give at most 100 IP addresses at a time from the ip.list file to a short shell script. It will arrange with four parallel invocations of the script.

The short shell script:

for n do
    grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done

This will just iterate over the 100 IP addresses that xargs gives it on its command line, and apply pretty much the same grep command that you had, the difference is that there will be four of these loops running in parallel.

Increase -P 4 to -P 16 or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep would read from and write to the same disk.

Except for the -P flag to xargs, all things in this answer should be able to run on any POSIX system. The -P flag for xargs is non-standard but implemented in GNU xargs and on BSD systems.

Dark Matter · Accepted Answer · 2018-04-10 17:37:32Z

For various approaches: https://stackoverflow.com/questions/9066609/fastest-possible-grep

In addition to that, If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.

You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.

When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.

Ole Tange · Accepted Answer · 2018-04-15 23:01:26Z

If /var/log/http/access.log is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log multiple times - especially if you have multiple cores. This will run one grep per IP in parallel (+ a couple of helping wrapping processes).

pargrep() {
    # Send standard input to grep with different match strings in parallel
    # This command would be enough if you only have 250 match strings
    parallel --pipe --tee grep ^{} '>' /tmp/access-{}.log ::: "$@"
}
export -f pargrep
# Standard input is tee'ed to several pargreps.
# Each pargrep gets 250 match strings and thus starts 250 processes.
# For 1200 ips this starts 3600 processes taking around 1 GB RAM,
# but it reads access.log only once
cat /var/log/http/access.log |
  parallel --pipe --tee -N250 pargrep {} :::: ips

Stack Exchange Network

Script Processes in Parallel

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Script Processes in Parallel

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions