How to optimize bash script parsing multiple gzipped files with multiple patterns?

Question

I have a bash script which iterates over many files: f1.gz, f2.gz, .. fn.gz Each files contains millions of lines and each line could match one pattern out of set: p1, p2, .. pn Depending on that, the matching line should go to a specific file. The patterns are obtained with date manipulations.

I wrote a couple of versions of the same but I'm not satisfied at all and I would like to ask if any better way/solution can be achieved without recurring to writing anything in compiled language.

Here's what I have:

for FILE in `ls f*.gz`
do

    echo "uncompressing only once per file -- $FILE: " 
    gzcat $FILE > .myfile.txt

    while IFS='' read -r LINE || [[ -n "$LINE" ]]; do

        for DATE in "$@" # I pass to my script several dates like 20201015, 20201014, etc
        do
            for i in {0..23}; 
            do
                p="DATE_PATTERNS_$DATE[$i]" # I prepared these before to avoid running "date" millions of times
                echo $LINE | awk -v pat=${!p} -F '"' '$1 ~ pat {print $2" "$4" "$6}' >> $DATE.txt
            done
        done

    done < .myfile.txt
done

Thanks

Yes. Quoting expansions will prevent filename expansion and word splitting, which by itself will be less job for shell. How is DATE_PATTERNS_$DATE[$i] generated? How you prepared it? echo $LINE - this is just a constant pattern? — KamilCuk
– KamilCuk, Commented Oct 15, 2020 at 20:24
I suspect your task can be done much faster, in a robust way, with one awk command for all files. If you add some sample lines for input and output. — thanasisp
– thanasisp, Commented Oct 15, 2020 at 20:31
load your 120 patterns into a file, then pass 2 files to awk: awk '{ ....}' pattern_file myfile.txt; have awk load the first file into an array, then as parsing the 2nd file look for the desired field 'in' the array; google search on awk load array files FNR==NR NR==FNR will bring up a ton of hits, like this and this; net result: one awk per file, each line scanned just once — markp-fuso
– markp-fuso, Commented Oct 15, 2020 at 20:53
I don't have time to take all of that in, but I did notice gzcat $ifile > tmp. Look at processing like gunzip -c $file | awk -v inputList="....." ' ...' where your awk accepts a list of dates/conditions that it will filter for and will use internal print $0 > "/path/to/data/file.txt" commands to generate your output. Good luck. — shellter
– shellter, Commented Oct 15, 2020 at 21:14

Walter A · Accepted Answer · 2020-10-15 21:56:33Z

1

When you don't want to replace the code with one awk looping through the dates, you can start with removing the while (and opening the outputfile less often):

for FILE in f*.gz; do
   echo "uncompressing only once per file -- $FILE: " 
   gzcat $FILE > .myfile.txt

   # I pass to my script several dates like 20201015, 20201014, etc
   for DATE in "$@"; do
      for i in {0..23}; 
      do
         p="DATE_PATTERNS_$DATE[$i]" 
         awk -v pat=${!p} -F '"' '$1 ~ pat {print $2" "$4" "$6}' .myfile.txt 
      done
   done >> $DATE.txt
done

When you still have tried this and still want improvements, consider moving the for DATE and for i into awk and/or start gzcat f*gz > .mycombinedfiles.txt (when diskspace is no issue).

answered Oct 15, 2020 at 21:56

Walter A

20.2k2 gold badges29 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JoeSlav Over a year ago

Thanks. I did part of this yesterday and the script is now workable. As a note it's actually faster to do a grep pattern | awk rather than a awk with a regex matching. I'll try the >> $DATE.txt too.

Collectives™ on Stack Overflow

How to optimize bash script parsing multiple gzipped files with multiple patterns?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related