3
\$\begingroup\$

This script converts the numbers to be at least 13 characters long (for UNIX_MS strings). For use with timestamps, I'm having issues with it being very slow. I wanted an alternative to grepping for one specific UNIX_MS timestamp and not finding it and then having to grep multiple more times.

For the output I wanted the line number in the file (for slicing) as well as the original line (to confirm/inspect).

I'm very specifically looking for optimizations as I'd like this to be as-close-to-as-fast as grepping a single timestamp.

Usage: ./script.sh file UNIX_MS

#! /bin/bash
# return the first number found that's greater than the provided input number
res=
linenum=
count=0
returnline=             # hold onto the line for return
tocheck=$2
tocheck=$(($tocheck*(10**(( ${#tocheck} - 13) * -1))))
inseconds=$(($tocheck/1000))
date=$(date -r $inseconds)
echo "Looking for first timestamp -ge to $date.."
while read line;
  do
    count=$(($count + 1))
    timearray=$(grep -o -E "^(.*?)([0-9]{10,13})?" <<< $line)
    if [ -z "$timearray" ]; then
    echo "PROBLEM"
        echo "grep -o -E '^(.*?)([0-9]{10,13})?' <<< $line"
        exit 1
    fi
    timestamp=$(sed -Ee "s/[0-9]+\://" <<< $timearray)
# normalize timestamp to be 13 digits
    if [ "${#timestamp}" -lt "13" ]; then
        mult=$((10**(( ${#timestamp} - 13) * -1)))
        timestamp=$(($timestamp * $mult))
    fi
    echo "$timestamp >= $tocheck?"
    linenum=$([ "$timestamp" -ge "$tocheck" ] && echo $count)
    if [ -z "$linenum" ]; then
        :;
    else
        returnline=$line 
    break;
    fi
done < $1
echo "$linenum:$returnline"
echo ""
\$\endgroup\$
5
  • \$\begingroup\$ Can you show sample input so we can benchmark? \$\endgroup\$ Commented Dec 4, 2015 at 16:53
  • 1
    \$\begingroup\$ Even if you don't show us the whole file for benchmarking, at least show a couple of representative sample lines of the log file. \$\endgroup\$ Commented Dec 4, 2015 at 18:21
  • \$\begingroup\$ Could you explain what the timestamp format is, such that you would want to zero-pad the numbers on the right side? \$\endgroup\$ Commented Dec 4, 2015 at 18:43
  • \$\begingroup\$ It's unix milliseconds. See epochconverter.com. \$\endgroup\$ Commented Dec 4, 2015 at 19:10
  • \$\begingroup\$ That still doesn't explain why you would want to zero-pad numbers on the right rather than on the left. \$\endgroup\$ Commented Dec 4, 2015 at 19:41

2 Answers 2

4
\$\begingroup\$

I'm having issues with it being very slow.

What makes your script code slow is that you're reading in the file yourself with the while loop, and apply grep to each single input line, instead of passing grep the file itself and let it just do it's job.

No matter what you want to search for with grep, you should always first pass your input to it with a single call, and inspect the results afterwards.

For the output I wanted the line number in the file (for slicing) as well as the original line (to confirm/inspect).

grep already has this feature intrinsically (at least as it says from this documentation):

-n, --line-number
Prefix each line of output with the line number within its input file.

you simply can do what you want using this option.

Thus you can get rid of your while loop and count variable to determine the line number yourself.

\$\endgroup\$
6
  • \$\begingroup\$ How would you accomplish the numerical timestamp comparison, though? \$\endgroup\$ Commented Dec 4, 2015 at 18:12
  • \$\begingroup\$ @200_success From further inspection of the already produced results of a single run of grep? May be I misunderstood the question. \$\endgroup\$ Commented Dec 4, 2015 at 18:14
  • \$\begingroup\$ For speed I was attempting to output the first result without having to potentially read-in the entire file. \$\endgroup\$ Commented Dec 4, 2015 at 18:18
  • \$\begingroup\$ @octanepenguin grep will usually go through entire files very fast, you can expect that will be faster than using it multiple times and reading line by line yourself in the script. I hope I made my idea clear enough. \$\endgroup\$ Commented Dec 4, 2015 at 18:23
  • \$\begingroup\$ @πάνταῥεῖ In my case though the files can be upwards of 20+ GB and a result can be in the middle. I get that I'm reading it line by line but how is that any different than grep? Surely it has to seek the file as well? \$\endgroup\$ Commented Dec 4, 2015 at 18:32
3
\$\begingroup\$

bash has built-in regular expression support:

    if [[ $line =~ ^(.*?)([0-9]{10,13})? ]]; then
        timestamp=$BASH_REMATCH[2]
    else
        echo "PROBLEM!"
    fi

which completely removes need for grep and sed.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.