Performance issues with bash script

Question

I have written a bash script that is responsible for 'collapsing' a log file. Given a log file of the format:

21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line 
message
that may continue 
several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message

Collapse the log file to a single lined file with a separator character:

21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line; message; that may continue; several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message

The following bash script achieves this goal, but at an excruciatingly slow pace. A 500mb input log may take 30 minutes on an 8 core 32 gb machine.

while read -r line; do

  if [ -z "$line" ]; then
    BUFFER+=$LINE_SEPERATOR
    continue
  done

  POSSIBLE_DATE='cut -c1-11 <<< $line'
  if [ "$PREV_DATE" == "$POSSIBLE_DATE" ]; then # Usually date won't change, big comparison saving.
    if [ -n "$BUFFER" ]; then
      echo $BUFFER
      BUFFER=""
    fi

    BUFFER+="$line"
  elif [[ "$POSSIBLE_DATE" =~ ^[0-3][0-9]\ [A-Za-z]{3}\ 2[0-9]{3} ]]; then # Valid date.
    PREV_DATE="$POSSIBLE_DATE"
    if [ -n "$BUFFER" ]; then
      echo $BUFFER
      BUFFER=""
    fi

    BUFFER+="$line"
  else
    BUFFER+="$line"
  fi
done

Any ideas how I can optimize this script? It doesn't appear as though the regex is the bottleneck (my first optimization) as now that condition is rarely hit.

Most of the lines in the log file are single lines, so its just a straight up comparison of the first 11 chars, doesn't seem like it should be so computationally expensive?

Thanks.

Just use Python. It'll be so much better than spawning processes every time you read one line. Or use AWK. — John Zwinck
– John Zwinck, Commented Oct 21, 2017 at 10:57
POSSIBLE_DATE='cut -c1-11 <<< $line' unless there is a copy-paste problem, your condition isn't testing what you want it to... — Mat
– Mat, Commented Oct 21, 2017 at 10:57

Rahul Verma · Accepted Answer · 2017-10-21 12:16:04Z

2

using awk

It will be much more faster as it won't spawn multiple processes.

$ awk '/^[^0-9]/{ORS="; "} /^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""} END{printf RS}1' file
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line message; that may continue ; several lines; 
21 Oct 2017 12:38:07 [DEBUG] Single line message

/^[^0-9]/{ORS="; "} : If line starts with non-digit then set Output Record Separator as ; instead of default \n

/^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""}: If it starts with a digit then set ORS="" and prepend RS or \n to the record (with exception of first line i.e FNR==1 where we don't want a newline at the start)

answered Oct 21, 2017 at 12:16

Rahul Verma

3,1091 gold badge17 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jamie Over a year ago

Thanks! This works great. I had to modify the regex to be a little more aggressive in parsing the start dates, but its working brilliantly. Thanks for the explanation.

ctac_ · Accepted Answer · 2017-10-21 12:10:40Z

1

You can use sed

sed ':B;/^[0-9][0-9]* /N;/\n[0-9][0-9]* /!{s/\n/; /;bB};h;s/\n.*//p;x;s/.*\n//;tB' infile

You can adjust the regex '[0-9][0-9]* ' to your need.

answered Oct 21, 2017 at 12:10

ctac_

2,5012 gold badges10 silver badges18 bronze badges

Collectives™ on Stack Overflow

Performance issues with bash script

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related