Extracting timestamps from multiple lines in a log file with unix shell tools

Question

Input is a log file. The process I'm currently interested in, logs a line at the start and end of the process. The start always has a certain fixed pattern along with an object ID. The end also has a fixed pattern, along with the same object ID.

I want the output to contain a single line per object ID, followed by the timestamp of the first line, followed by the timestamp of the second line. This output will be used for further analysis in other tools. Output should be sorted on the timestamp of the start-line; objects without start lines (see obstacles) should be placed at the end.

I'd like to solve this using standard Unix shell tools. At a guess, something with awk should do the trick. If the solution involves a Unix shell script, please use sh as the shell.

Obstacles: I cannot guarantee that the process is strictly sequential, so the start of object1 can be followed by the start of object2 before object1 has been processed fully. Also, I cannot guarantee that the logfile always matches a start with an end, or vice versa. In such cases, the ID should have an empty value for the missing spot.

Input looks is, in essence, something like this:

2014-03-11 09:00:01.123 bla bla bla TAG_START ID:1234 bla bla bla
2014-03-11 09:00:11.123 bla bla bla TAG_END ID:1234 bla bla bla
2014-03-11 09:01:01.123 bla bla bla TAG_START ID:2353 bla bla bla
2014-03-11 09:02:01.123 bla bla bla TAG_END ID:2353 bla bla bla
2014-03-11 09:03:01.123 bla bla bla TAG_START ID:3456 bla bla bla
2014-03-11 09:04:01.123 bla bla bla TAG_END ID:4567 bla bla bla

Output:

1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

Thanks in advance!

Good catch. Copy-paste mistake. I'll see if I can edit this. — user3426170
– user3426170, Commented Mar 16, 2014 at 17:24
Preferably either in order by start-tag, or in reverse order by start-tag. — user3426170
– user3426170, Commented Mar 16, 2014 at 17:26
Wait, to be exact: by time stamp of the start-tag. Leave the empty tags at the end. — user3426170
– user3426170, Commented Mar 16, 2014 at 17:28

jaypal singh · Accepted Answer · 2014-03-16 20:36:10Z

You can try something like with GNU awk (using asorti function for sorted output):

gawk '
function findID(line) {
    for (i = 1; i<=NF; i++)
    if ($i ~ /^ID/)
        split($i, tmp, /:/)
        return tmp[2]
}
/TAG_START/ {
    id = findID($0)
    lines[id] = $2 ";"
}
/TAG_END/ {
    id = findID($0)
    lines[id] = ((lines[id]) ? lines[id] $2 : ";" $2)
}
END {
    n = asorti(lines, lines_s)
    for (i = 1; i <= n; i++) {
        print lines_s[i] ";" lines[lines_s[i]]
    }
}' file

If you don't have GNU awk then you can pipe the output of regular awk to sort.

awk '
function findID(line) {
    for (i = 1; i<=NF; i++)
    if ($i ~ /^ID/)
        split($i, tmp, /:/)
        return tmp[2]
}
/TAG_START/ {
    id = findID($0)
    lines[id] = $2 ";"
}
/TAG_END/ {
    id = findID($0)
    lines[id] = ((lines[id]) ? lines[id] $2 : ";" $2)
}
END {
    for (x in lines)
        print x ";" lines[x]
}' file | sort -t";" -nk1,2

Output:

1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

Explanation:

For lines having /TAG_START/ we call our user defined function that iterate over each fields delimited by space. Once we encounter a field that starts with ID we split it with : delimiter and capture the second portion of it (that is if the field is TAG_START ID:1234 we capture 1234).
We use that as a key in our array lines and assign it the value of second field on that line, which is the timestamp and pad a ; after it.
We do similar actions for lines having /TAG_END/ only difference being we check for the existence of the key in our array. If it is present we append the second field to it, since it is end timestamp. If the key is not present then we simply prepend ; and add the value to the array. This is to meet your requirement Also, I cannot guarantee that the logfile always matches a start with an end, or vice versa. In such cases, the ID should have an empty value for the missing spot.
For GNU awk we call the asorti function to sort by value and iterate over the array and print the lines. For regular awk we print the lines and pipe it to sort.

Ed Morton · Accepted Answer · 2014-03-17 15:22:32Z

1

Output will be in the same order as the ids appear in your input:

awk -v OFS=';' '
{
    time = $2

    type = (/TAG_START ID:/ ? "s" : "e")

    sub(/.*TAG_(START|END) ID:/,"")
    sub(/ .*$/,"")
    id = $0

    if (!seen[id]++) {
        ids[++numIds] = id
    }

    times[id,type] = time
}
END {
    for (idNr=1; idNr<=numIds; idNr++) {
        id = ids[idNr]
        print id, times[id,"s"], times[id,"e"]
    }
}' file
1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

The if statement is just keeping track of unique ids in the order they are seenin the input file. The first time an id is seen the array seen[id] has the value zero because that is a new unique id and so the counter numIds is pre-incremented and the id is stored in the ids array at the position indexed by the new value of numIds. Since seen[id] was post-incremented in the if, the next time that id is seen seen[id] has the value 1 and so the condition !seen[id] is now false.

It's just the idiomatic awk approach for how to keep a list of unique keys (ids) in the order they occur in the input so they can be referenced in that order in the END section rather than random order using the in statement.

edited Mar 17, 2014 at 15:22

answered Mar 16, 2014 at 19:23

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

2 Comments

user3426170 Over a year ago

I think I understand most of this. You start with extracting the timestamp, the type of line (start / end) and the ID. Then, there is a bit that I don't quite understand (seen-array). Then, you add the time to a multidimensional array times. The END-block outputs it in order of the ids-array. From the variable-name I can reason that the seen-array block checks if the ID has been encountered before. But I don't see how this actually works. Would you mind elaborating that if-statement?

Ed Morton Over a year ago

I added an explanation to my answer.

BMW · Accepted Answer · 2014-03-18 10:34:30Z

1

Use arrays of arrays in gnu awk.

awk '{split($7,c,":");a[c[2]][$6]=$2;b[c[2]]}
END{for (i in b) {print i,a[i]["TAG_START"],a[i]["TAG_END"]}}' OFS=";" file

1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

Explanation

sample $7 is ID:1234, split to array c, and use the value c[2] as index in array a.
with arrays of arrays, you can print two values a[i]["TAG_START"] and a[i]["TAG_END"] directly

New version if the ID position is not fixed.

awk '{for (i=1;i<=NF;i++) if ($i ~/TAG_(START|END)/) {status=$i;id=$(i+1)};split(id,c,":");a[c[2]][status]=$2;b[c[2]]}
END{for (i in b) {print i,a[i]["TAG_START"],a[i]["TAG_END"]}}' OFS=";" file

edited Mar 18, 2014 at 10:34

answered Mar 17, 2014 at 4:02

BMW

45.6k13 gold badges105 silver badges124 bronze badges

4 Comments

Ed Morton Over a year ago

Interesting, it never occurred to me that bla bla bla might literally be 3 words, I just assumed it meant some random text. If it is 3 words then things certainly get a whole lot simpler.

user3426170 Over a year ago

Indeed interesting. It hadn't crossed my mind to treat that portion of text as fixed. It would make things easier. Maybe for next time, I should consider how variable my input actually is or whether or not it's truly a fixed string. Thanks!

Ed Morton Over a year ago

If the ID really does exist at a specific field then @BMW's solution is the way to go but tweaked with an if (!seen[c[2]]++) { b[++numIds] = c[2] } statement to replace b[c[2]] and change to the loop from my solution to preserve ordering, and just change every ][ to , if you don't want it to be gawk-specific since it's not doing anything you need true 2D arrays for.

BMW Over a year ago

It's not big deal if the id position is fixed or not, if you fully understand my code. You can easily adjust it. I updated the new one in my answer.

Collectives™ on Stack Overflow

Extracting timestamps from multiple lines in a log file with unix shell tools

3 Answers 3

Comments

2 Comments

Explanation

New version if the ID position is not fixed.

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Explanation

New version if the ID position is not fixed.

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related