1

Input is a log file. The process I'm currently interested in, logs a line at the start and end of the process. The start always has a certain fixed pattern along with an object ID. The end also has a fixed pattern, along with the same object ID.

I want the output to contain a single line per object ID, followed by the timestamp of the first line, followed by the timestamp of the second line. This output will be used for further analysis in other tools. Output should be sorted on the timestamp of the start-line; objects without start lines (see obstacles) should be placed at the end.

I'd like to solve this using standard Unix shell tools. At a guess, something with awk should do the trick. If the solution involves a Unix shell script, please use sh as the shell.

Obstacles: I cannot guarantee that the process is strictly sequential, so the start of object1 can be followed by the start of object2 before object1 has been processed fully. Also, I cannot guarantee that the logfile always matches a start with an end, or vice versa. In such cases, the ID should have an empty value for the missing spot.

Input looks is, in essence, something like this:

2014-03-11 09:00:01.123 bla bla bla TAG_START ID:1234 bla bla bla
2014-03-11 09:00:11.123 bla bla bla TAG_END ID:1234 bla bla bla
2014-03-11 09:01:01.123 bla bla bla TAG_START ID:2353 bla bla bla
2014-03-11 09:02:01.123 bla bla bla TAG_END ID:2353 bla bla bla
2014-03-11 09:03:01.123 bla bla bla TAG_START ID:3456 bla bla bla
2014-03-11 09:04:01.123 bla bla bla TAG_END ID:4567 bla bla bla

Output:

1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

Thanks in advance!

7
  • 1
    Wouldn't the last line of output start with 4567? Commented Mar 16, 2014 at 17:09
  • Good catch. Copy-paste mistake. I'll see if I can edit this. Commented Mar 16, 2014 at 17:24
  • @user3426170 Does the order of output matters? Commented Mar 16, 2014 at 17:24
  • Preferably either in order by start-tag, or in reverse order by start-tag. Commented Mar 16, 2014 at 17:26
  • Wait, to be exact: by time stamp of the start-tag. Leave the empty tags at the end. Commented Mar 16, 2014 at 17:28

3 Answers 3

2

You can try something like with GNU awk (using asorti function for sorted output):

gawk '
function findID(line) {
    for (i = 1; i<=NF; i++)
    if ($i ~ /^ID/)
        split($i, tmp, /:/)
        return tmp[2]
}
/TAG_START/ {
    id = findID($0)
    lines[id] = $2 ";"
}
/TAG_END/ {
    id = findID($0)
    lines[id] = ((lines[id]) ? lines[id] $2 : ";" $2)
}
END {
    n = asorti(lines, lines_s)
    for (i = 1; i <= n; i++) {
        print lines_s[i] ";" lines[lines_s[i]]
    }
}' file

If you don't have GNU awk then you can pipe the output of regular awk to sort.

awk '
function findID(line) {
    for (i = 1; i<=NF; i++)
    if ($i ~ /^ID/)
        split($i, tmp, /:/)
        return tmp[2]
}
/TAG_START/ {
    id = findID($0)
    lines[id] = $2 ";"
}
/TAG_END/ {
    id = findID($0)
    lines[id] = ((lines[id]) ? lines[id] $2 : ";" $2)
}
END {
    for (x in lines)
        print x ";" lines[x]
}' file | sort -t";" -nk1,2

Output:

1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

Explanation:

  • For lines having /TAG_START/ we call our user defined function that iterate over each fields delimited by space. Once we encounter a field that starts with ID we split it with : delimiter and capture the second portion of it (that is if the field is TAG_START ID:1234 we capture 1234).
  • We use that as a key in our array lines and assign it the value of second field on that line, which is the timestamp and pad a ; after it.
  • We do similar actions for lines having /TAG_END/ only difference being we check for the existence of the key in our array. If it is present we append the second field to it, since it is end timestamp. If the key is not present then we simply prepend ; and add the value to the array. This is to meet your requirement Also, I cannot guarantee that the logfile always matches a start with an end, or vice versa. In such cases, the ID should have an empty value for the missing spot.
  • For GNU awk we call the asorti function to sort by value and iterate over the array and print the lines. For regular awk we print the lines and pipe it to sort.
Sign up to request clarification or add additional context in comments.

Comments

1

Output will be in the same order as the ids appear in your input:

awk -v OFS=';' '
{
    time = $2

    type = (/TAG_START ID:/ ? "s" : "e")

    sub(/.*TAG_(START|END) ID:/,"")
    sub(/ .*$/,"")
    id = $0

    if (!seen[id]++) {
        ids[++numIds] = id
    }

    times[id,type] = time
}
END {
    for (idNr=1; idNr<=numIds; idNr++) {
        id = ids[idNr]
        print id, times[id,"s"], times[id,"e"]
    }
}' file
1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

The if statement is just keeping track of unique ids in the order they are seenin the input file. The first time an id is seen the array seen[id] has the value zero because that is a new unique id and so the counter numIds is pre-incremented and the id is stored in the ids array at the position indexed by the new value of numIds. Since seen[id] was post-incremented in the if, the next time that id is seen seen[id] has the value 1 and so the condition !seen[id] is now false.

It's just the idiomatic awk approach for how to keep a list of unique keys (ids) in the order they occur in the input so they can be referenced in that order in the END section rather than random order using the in statement.

2 Comments

I think I understand most of this. You start with extracting the timestamp, the type of line (start / end) and the ID. Then, there is a bit that I don't quite understand (seen-array). Then, you add the time to a multidimensional array times. The END-block outputs it in order of the ids-array. From the variable-name I can reason that the seen-array block checks if the ID has been encountered before. But I don't see how this actually works. Would you mind elaborating that if-statement?
I added an explanation to my answer.
1

Use arrays of arrays in gnu awk.

awk '{split($7,c,":");a[c[2]][$6]=$2;b[c[2]]}
END{for (i in b) {print i,a[i]["TAG_START"],a[i]["TAG_END"]}}' OFS=";" file

1234;09:00:01.123;09:00:11.123
2353;09:01:01.123;09:02:01.123
3456;09:03:01.123;
4567;;09:04:01.123

Explanation

  • sample $7 is ID:1234, split to array c, and use the value c[2] as index in array a.
  • with arrays of arrays, you can print two values a[i]["TAG_START"] and a[i]["TAG_END"] directly

New version if the ID position is not fixed.

awk '{for (i=1;i<=NF;i++) if ($i ~/TAG_(START|END)/) {status=$i;id=$(i+1)};split(id,c,":");a[c[2]][status]=$2;b[c[2]]}
END{for (i in b) {print i,a[i]["TAG_START"],a[i]["TAG_END"]}}' OFS=";" file

4 Comments

Interesting, it never occurred to me that bla bla bla might literally be 3 words, I just assumed it meant some random text. If it is 3 words then things certainly get a whole lot simpler.
Indeed interesting. It hadn't crossed my mind to treat that portion of text as fixed. It would make things easier. Maybe for next time, I should consider how variable my input actually is or whether or not it's truly a fixed string. Thanks!
If the ID really does exist at a specific field then @BMW's solution is the way to go but tweaked with an if (!seen[c[2]]++) { b[++numIds] = c[2] } statement to replace b[c[2]] and change to the loop from my solution to preserve ordering, and just change every ][ to , if you don't want it to be gawk-specific since it's not doing anything you need true 2D arrays for.
It's not big deal if the id position is fixed or not, if you fully understand my code. You can easily adjust it. I updated the new one in my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.