0

I have a bash script which iterates over many files: f1.gz, f2.gz, .. fn.gz Each files contains millions of lines and each line could match one pattern out of set: p1, p2, .. pn Depending on that, the matching line should go to a specific file. The patterns are obtained with date manipulations.

I wrote a couple of versions of the same but I'm not satisfied at all and I would like to ask if any better way/solution can be achieved without recurring to writing anything in compiled language.

Here's what I have:

for FILE in `ls f*.gz`
do

    echo "uncompressing only once per file -- $FILE: " 
    gzcat $FILE > .myfile.txt

    while IFS='' read -r LINE || [[ -n "$LINE" ]]; do

        for DATE in "$@" # I pass to my script several dates like 20201015, 20201014, etc
        do
            for i in {0..23}; 
            do
                p="DATE_PATTERNS_$DATE[$i]" # I prepared these before to avoid running "date" millions of times
                echo $LINE | awk -v pat=${!p} -F '"' '$1 ~ pat {print $2" "$4" "$6}' >> $DATE.txt
            done
        done

    done < .myfile.txt
done 

Thanks

14
  • 3
    So many issues. Start by checking shellcheck.net Commented Oct 15, 2020 at 20:20
  • 1
    Yes. Quoting expansions will prevent filename expansion and word splitting, which by itself will be less job for shell. How is DATE_PATTERNS_$DATE[$i] generated? How you prepared it? echo $LINE - this is just a constant pattern? Commented Oct 15, 2020 at 20:24
  • 1
    I suspect your task can be done much faster, in a robust way, with one awk command for all files. If you add some sample lines for input and output. Commented Oct 15, 2020 at 20:31
  • 1
    load your 120 patterns into a file, then pass 2 files to awk: awk '{ ....}' pattern_file myfile.txt; have awk load the first file into an array, then as parsing the 2nd file look for the desired field 'in' the array; google search on awk load array files FNR==NR NR==FNR will bring up a ton of hits, like this and this; net result: one awk per file, each line scanned just once Commented Oct 15, 2020 at 20:53
  • 1
    I don't have time to take all of that in, but I did notice gzcat $ifile > tmp. Look at processing like gunzip -c $file | awk -v inputList="....." ' ...' where your awk accepts a list of dates/conditions that it will filter for and will use internal print $0 > "/path/to/data/file.txt" commands to generate your output. Good luck. Commented Oct 15, 2020 at 21:14

1 Answer 1

1

When you don't want to replace the code with one awk looping through the dates, you can start with removing the while (and opening the outputfile less often):

for FILE in f*.gz; do
   echo "uncompressing only once per file -- $FILE: " 
   gzcat $FILE > .myfile.txt

   # I pass to my script several dates like 20201015, 20201014, etc
   for DATE in "$@"; do
      for i in {0..23}; 
      do
         p="DATE_PATTERNS_$DATE[$i]" 
         awk -v pat=${!p} -F '"' '$1 ~ pat {print $2" "$4" "$6}' .myfile.txt 
      done
   done >> $DATE.txt
done 

When you still have tried this and still want improvements, consider moving the for DATE and for i into awk and/or start gzcat f*gz > .mycombinedfiles.txt (when diskspace is no issue).

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. I did part of this yesterday and the script is now workable. As a note it's actually faster to do a grep pattern | awk rather than a awk with a regex matching. I'll try the >> $DATE.txt too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.