Matching multiple patterns in the same line using unix utilities

Question

I am trying to find this pattern match. I want to match and display only the first of the matches in the same line. And one of the matches, the fourth field can be match either of the two patterns i.e; A,BCD.EF or AB.CD . An example would be

Example 1:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00

The expected output would be

Expected Result 1:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00

I have got this far using my little knowledge of grep and stackoverflow.

< test_data.txt grep -one "[0-9]/[0-9][0-9]\|[0-9]*,[0-9]*.[0-9][0-9]\|[0-9]*.[0-9][0-9]" | awk -F ":" '$1 == y { sub(/[^:]:/,""); r = (r ? r OFS : "") $0; next } x { print x, r; r="" } { x=$0; y=$1; sub(/[^:]:/,"",x) } END { print x, r }'

Any ideas to make this simpler or cleaner and to achieve the complete functionality.

Update 1: Few other examples could be:

Example 2:
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00

There could be more fields in some lines.
The order of fields are not necessarily preserved either. I could get around this by treating the files which have different order separately or transforming them to this order somehow. So this condition can be relaxed.

Update 2: Seems like somehow my question was not clear. So one way looking at it would be to look for: the first "time" I find on a line, the first set of alpha-numeric string and first decimal values with/without comma in it, all of them printed on the same output line. A more generic description would be, Given an input line, print the first occurrence of pattern 1, first occurrence of pattern 2 and first occurrence of pattern 3 (which itself is an "or" of two patterns) in one line in the output and must be stable (i.e; preserving the order they appeared in input). Sorry it is a little complicated example and I am also trying to learn if this is the sweet spot to leave using Unix utilities for a full language like Perl/Python. So here is the expected results for the second set of examples.

Expected Result 2:
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00

How is this different from just displaying fields 1, 3, and 4? Can you show a more complex example of input data, so we can see what needs to be excluded? — Barmar
– Barmar, Commented Sep 19, 2013 at 22:19
@sumodds But now there's another type of column before it. How would the second update look like. Is the condition about NN:NN similar to NN.00 and N,NNN.NN? Please be more specific and accurate as people wouldn't want to keep revising their solutions due to unexpected updates. — konsolebox
– konsolebox, Commented Sep 19, 2013 at 23:00

konsolebox · Accepted Answer · 2013-09-20 00:19:45Z

3

#!/usr/bin/awk -f

BEGIN {
    p[0] = "^[0-9]+:[0-9]{2}$"
    p[1] = "^[[:alpha:]][[:alnum:]]*$"
    p[2] = "^[0-9]+[0-9,]*[.][0-9]{2}$"
}

{
    i = 0
    for (j = 1; j <= NF; ++j) {
        for (k = 0; k in p; ++k) {
            if ($j ~ p[k] && !q[k]++ && j > ++i) {
                $i = $j
            }
        }
    }
    q[0] = q[1] = q[2] = 0
    NF = i
    print
}

Input:

12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00 
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00

Output:

12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00

edited Sep 20, 2013 at 0:19

answered Sep 20, 2013 at 0:09

konsolebox

76.3k13 gold badges110 silver badges114 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sumodds Over a year ago

Nice. Thanks. I was also writing something similar using match instead. But I am still not near complete. Thanks again.

Erracity Over a year ago

This is a fine solution to an obnoxious problem I run in to when parsing information from logs, thanks.

robert.r · Accepted Answer · 2013-09-19 23:28:48Z

1

Perl-regex style should solve the problem:

(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))

It will capture the following data (procesing each line You provided separately):

RESULT$VAR1 = [
          '12:23',
          'ASDFGH',
          '1,232.00'
        ];
RESULT$VAR1 = [
          '21:22',
          'ASDSDS',
          '22.00'
        ];
RESULT$VAR1 = [
          '12:21',
          'ASADSS',
          '11.00'
        ];
RESULT$VAR1 = [
          '22:22',
          'BASDASD',
          '1,231.00'
        ];

Example perl script.pl:

#!/usr/bin/perl
use strict;
use Data::Dumper;

open my $F, '<', shift @ARGV;

my @strings = <$F>;
my $qr = qr/(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))/;

foreach my $string (@strings) {
    chomp $string;
    next if not $string;
    my @tab = $string =~ $qr;
    print join(" ", @tab) . "\n";
}

Run as:

perl script.pl test_data.txt

Cheers!

edited Sep 19, 2013 at 23:28

answered Sep 19, 2013 at 22:51

robert.r

313 bronze badges

1 Comment

sumodds Over a year ago

I am trying to resist using a full blown language in order to learn the sweet spot for using Unix utilities alone. Though awk definitely smears the boundary :).

Collectives™ on Stack Overflow

Matching multiple patterns in the same line using unix utilities

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related