1

I have written a bash script that is responsible for 'collapsing' a log file. Given a log file of the format:

21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line 
message
that may continue 
several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message

Collapse the log file to a single lined file with a separator character:

21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line; message; that may continue; several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message

The following bash script achieves this goal, but at an excruciatingly slow pace. A 500mb input log may take 30 minutes on an 8 core 32 gb machine.

while read -r line; do

  if [ -z "$line" ]; then
    BUFFER+=$LINE_SEPERATOR
    continue
  done

  POSSIBLE_DATE='cut -c1-11 <<< $line'
  if [ "$PREV_DATE" == "$POSSIBLE_DATE" ]; then # Usually date won't change, big comparison saving.
    if [ -n "$BUFFER" ]; then
      echo $BUFFER
      BUFFER=""
    fi

    BUFFER+="$line"
  elif [[ "$POSSIBLE_DATE" =~ ^[0-3][0-9]\ [A-Za-z]{3}\ 2[0-9]{3} ]]; then # Valid date.
    PREV_DATE="$POSSIBLE_DATE"
    if [ -n "$BUFFER" ]; then
      echo $BUFFER
      BUFFER=""
    fi

    BUFFER+="$line"
  else
    BUFFER+="$line"
  fi
done

Any ideas how I can optimize this script? It doesn't appear as though the regex is the bottleneck (my first optimization) as now that condition is rarely hit.

Most of the lines in the log file are single lines, so its just a straight up comparison of the first 11 chars, doesn't seem like it should be so computationally expensive?

Thanks.

2
  • 2
    Just use Python. It'll be so much better than spawning processes every time you read one line. Or use AWK. Commented Oct 21, 2017 at 10:57
  • POSSIBLE_DATE='cut -c1-11 <<< $line' unless there is a copy-paste problem, your condition isn't testing what you want it to... Commented Oct 21, 2017 at 10:57

2 Answers 2

2

using awk

It will be much more faster as it won't spawn multiple processes.

$ awk '/^[^0-9]/{ORS="; "} /^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""} END{printf RS}1' file
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line message; that may continue ; several lines; 
21 Oct 2017 12:38:07 [DEBUG] Single line message

/^[^0-9]/{ORS="; "} : If line starts with non-digit then set Output Record Separator as ; instead of default \n

/^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""}: If it starts with a digit then set ORS="" and prepend RS or \n to the record (with exception of first line i.e FNR==1 where we don't want a newline at the start)

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! This works great. I had to modify the regex to be a little more aggressive in parsing the start dates, but its working brilliantly. Thanks for the explanation.
1

You can use sed

sed ':B;/^[0-9][0-9]* /N;/\n[0-9][0-9]* /!{s/\n/; /;bB};h;s/\n.*//p;x;s/.*\n//;tB' infile

You can adjust the regex '[0-9][0-9]* ' to your need.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.