3

To investigate within logs, I am trying to find the very first time a vulnerability in a workflow has been exploited.

The pattern is on multiple lines.

The pattern would be

AAAAAAAAA
BBBBBBBBB
CCCCCCCCC

The problem is that

AAAAAAAAA

or

BBBBBBBBB

or

CCCCCCCCC

Can be found anywhere indivdually in the log without showing the vulnerability; it is the exact pattern in this exact order that will help me.

For example

grep -Ei "AAAAAAAAA|BBBBBBBBB|CCCCCCCCC" logfile does not help me since all the lines with individual occurence of AAAAAAAAA BBBBBBBBB CCCCCCCCC will be there.

How can I solve this?

2
  • There are quite a number of "multiline match" questions - this one for example Multiline Regexp (grep, sed, awk, perl) looks close to yours Commented Apr 3, 2021 at 21:20
  • I don't think a multi-line regular expression is going to work in this context. since it would need to be AAAAA.*BBBBB.*CCCCC and each candidate AAAAA would force grep to span the rest of the "Big" file. Commented Apr 3, 2021 at 23:16

4 Answers 4

1

Here's a way you can do it in python (I added to your example a bit to prove that you can still get the matches you desire even if there are random single lines of AAAAAAAAA, BBBBBBBBB, or CCCCCCCCC dispersed throughout the logfile) :

below are the contents of find_log_vulns.py

#! /usr/bin/python3

import re

test_string = """1234324
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
absdfjv4er4
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
123466666
AAAAAAAAA
ghrhvhhhfh
BBBBBBBBB
fjwjefjsjfjwjf
CCCCCCCCC
24wfsgggg
AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
zzzz"""

matches = re.findall('AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n', test_string, re.MULTILINE)

print(matches)

The result I get from running the above:

$ ./find_log_vulns.py
['AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n', 'AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n', 'AAAAAAAAA\nBBBBBBBBB\nCCCCCCCCC\n']

As shown above, each match will be returned as an element in a list.

1

using ripgrep:

rg -U 'A+\nB+\nC+' in
2:AAAAAAAAA
3:BBBBBBBBB
4:CCCCCCCCC
6:AAAAAAAAA
7:BBBBBBBBB
8:CCCCCCCCC
16:AAAAAAAAA
17:BBBBBBBBB
18:CCCCCCCCC

you can get rid of the line numbers, and so on. If you need separators between the matches you can do this:

rg -U 'A+\nB+\nC+' in | rg --passthru -e '(^A)' -r $'\n'A

AAAAAAAAA
BBBBBBBBB
CCCCCCCCC

AAAAAAAAA
BBBBBBBBB
CCCCCCCCC

AAAAAAAAA
BBBBBBBBB
CCCCCCCCC
1

Using awk:

awk -v ptrn="AAAAAAAAA\0BBBBBBBBB\0CCCCCCCCC\0" '
BEGIN{ split(ptrn, tmp, "\0"); lngth=gsub("\0", "", ptrn ) }
$0 ~ tmp[++fieldNr]{ buf=(buf==""?"": buf OFS) NR":"$0 ;
                     if ( fieldNr == lngth ) { print buf; exit }
                     next
                   }
{ fieldNr=0; buf="" }' infile

this will give you the line number followed by the matched line content; here we used "Partial Regexp Match" using the patterns from the "ptrn" against the lines. see How do I find the text that matches a pattern? for other matching options.

we used NUL character \0 to separate patterns.


Sample input:

AAAAAAAAA
BBBBBBBBB

CCCCCCCCC
AAAAAAAAA
BBBBBBBBB
ccccccccc
123AAAAAAAAA
BBBBBBBBB123
123CCCCCCCCC3

Output:

8:123AAAAAAAAA 9:BBBBBBBBB123 10:123CCCCCCCCC3
1

Just for fun with good old awk

cat file | wc -l
21287021

with > 3000,000 matches

time awk 'BEGIN{getline; a=$0; getline; b=$0}
       $0~/^C+$/ && a~/^A+$/ && b~/^B+$/{print "match starting on line "NR-2 }{a=b;b=$0}' file

real    0m12.644s
user    0m7.149s
sys     0m4.314s

Compared with rgon my machine

time rg -U 'A+\nB+\nC+' file
real    0m40.322s
user    0m16.503s
sys     0m17.246s
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.