0

I am trying to find this pattern match. I want to match and display only the first of the matches in the same line. And one of the matches, the fourth field can be match either of the two patterns i.e; A,BCD.EF or AB.CD . An example would be

Example 1:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00 

The expected output would be

Expected Result 1:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00

I have got this far using my little knowledge of grep and stackoverflow.

< test_data.txt grep -one "[0-9]/[0-9][0-9]\|[0-9]*,[0-9]*.[0-9][0-9]\|[0-9]*.[0-9][0-9]" | awk -F ":" '$1 == y { sub(/[^:]:/,""); r = (r ? r OFS : "") $0; next } x { print x, r; r="" } { x=$0; y=$1; sub(/[^:]:/,"",x) } END { print x, r }'

Any ideas to make this simpler or cleaner and to achieve the complete functionality.

Update 1: Few other examples could be:

Example 2:
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
  1. There could be more fields in some lines.
  2. The order of fields are not necessarily preserved either. I could get around this by treating the files which have different order separately or transforming them to this order somehow. So this condition can be relaxed.

Update 2: Seems like somehow my question was not clear. So one way looking at it would be to look for: the first "time" I find on a line, the first set of alpha-numeric string and first decimal values with/without comma in it, all of them printed on the same output line. A more generic description would be, Given an input line, print the first occurrence of pattern 1, first occurrence of pattern 2 and first occurrence of pattern 3 (which itself is an "or" of two patterns) in one line in the output and must be stable (i.e; preserving the order they appeared in input). Sorry it is a little complicated example and I am also trying to learn if this is the sweet spot to leave using Unix utilities for a full language like Perl/Python. So here is the expected results for the second set of examples.

Expected Result 2:
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00
7
  • How is this different from just displaying fields 1, 3, and 4? Can you show a more complex example of input data, so we can see what needs to be excluded? Commented Sep 19, 2013 at 22:19
  • How was the second column omited? Commented Sep 19, 2013 at 22:21
  • @konsolebox : I need it to be omitted. Commented Sep 19, 2013 at 22:33
  • @sumodds But now there's another type of column before it. How would the second update look like. Is the condition about NN:NN similar to NN.00 and N,NNN.NN? Please be more specific and accurate as people wouldn't want to keep revising their solutions due to unexpected updates. Commented Sep 19, 2013 at 23:00
  • @konsolebox : Sorry, hope its clearer now. Commented Sep 19, 2013 at 23:24

2 Answers 2

3
#!/usr/bin/awk -f

BEGIN {
    p[0] = "^[0-9]+:[0-9]{2}$"
    p[1] = "^[[:alpha:]][[:alnum:]]*$"
    p[2] = "^[0-9]+[0-9,]*[.][0-9]{2}$"
}

{
    i = 0
    for (j = 1; j <= NF; ++j) {
        for (k = 0; k in p; ++k) {
            if ($j ~ p[k] && !q[k]++ && j > ++i) {
                $i = $j
            }
        }
    }
    q[0] = q[1] = q[2] = 0
    NF = i
    print
}

Input:

12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00 
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00

Output:

12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00
Sign up to request clarification or add additional context in comments.

2 Comments

Nice. Thanks. I was also writing something similar using match instead. But I am still not near complete. Thanks again.
This is a fine solution to an obnoxious problem I run in to when parsing information from logs, thanks.
1

Perl-regex style should solve the problem:

(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))

It will capture the following data (procesing each line You provided separately):

RESULT$VAR1 = [
          '12:23',
          'ASDFGH',
          '1,232.00'
        ];
RESULT$VAR1 = [
          '21:22',
          'ASDSDS',
          '22.00'
        ];
RESULT$VAR1 = [
          '12:21',
          'ASADSS',
          '11.00'
        ];
RESULT$VAR1 = [
          '22:22',
          'BASDASD',
          '1,231.00'
        ];

Example perl script.pl:

#!/usr/bin/perl
use strict;
use Data::Dumper;

open my $F, '<', shift @ARGV;

my @strings = <$F>;
my $qr = qr/(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))/;

foreach my $string (@strings) {
    chomp $string;
    next if not $string;
    my @tab = $string =~ $qr;
    print join(" ", @tab) . "\n";
}

Run as:

perl script.pl test_data.txt

Cheers!

1 Comment

I am trying to resist using a full blown language in order to learn the sweet spot for using Unix utilities alone. Though awk definitely smears the boundary :).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.