I have written a bash script that is responsible for 'collapsing' a log file. Given a log file of the format:
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line
message
that may continue
several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message
Collapse the log file to a single lined file with a separator character:
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line; message; that may continue; several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message
The following bash script achieves this goal, but at an excruciatingly slow pace. A 500mb input log may take 30 minutes on an 8 core 32 gb machine.
while read -r line; do
if [ -z "$line" ]; then
BUFFER+=$LINE_SEPERATOR
continue
done
POSSIBLE_DATE='cut -c1-11 <<< $line'
if [ "$PREV_DATE" == "$POSSIBLE_DATE" ]; then # Usually date won't change, big comparison saving.
if [ -n "$BUFFER" ]; then
echo $BUFFER
BUFFER=""
fi
BUFFER+="$line"
elif [[ "$POSSIBLE_DATE" =~ ^[0-3][0-9]\ [A-Za-z]{3}\ 2[0-9]{3} ]]; then # Valid date.
PREV_DATE="$POSSIBLE_DATE"
if [ -n "$BUFFER" ]; then
echo $BUFFER
BUFFER=""
fi
BUFFER+="$line"
else
BUFFER+="$line"
fi
done
Any ideas how I can optimize this script? It doesn't appear as though the regex is the bottleneck (my first optimization) as now that condition is rarely hit.
Most of the lines in the log file are single lines, so its just a straight up comparison of the first 11 chars, doesn't seem like it should be so computationally expensive?
Thanks.
POSSIBLE_DATE='cut -c1-11 <<< $line'unless there is a copy-paste problem, your condition isn't testing what you want it to...