Regular expression help in Perl

Question

I have following text pattern

(2222) First Last (ab-cd/ABC1), <[email protected]> 1224: efadsfadsfdsf

(3333) First Last (abcd/ABC12), <[email protected]> 1234, 4657: efadsfadsfdsf

I want the number 1224 or 1234, 4657 from the above text after the text >.

I have this \((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\@\w*\.domain.com>\s\d+: which will take the text before : But i want the one after email till :

Is there any easy regular expression to do this? or should I use split and do this

Thanks

Edit: The whole text is returned by a command line tool.

(3333) First Last (abcd/ABC12), <[email protected]> 1234, 4657: efadsfadsfdsf

(3333) - Unique ID

First Last - First and last names

<[email protected]> - Email address in format [email protected]

1234, 4567 - database primary Keys

: xxxx - Headline

What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables

The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.

My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this

There are no e-mail addresses in your "following text pattern". Please provide examples which are closer to your real-world data. — zgpmax
– zgpmax, Commented Feb 13, 2012 at 16:49
@hochgurgler: Yes there is, OP just didn't have it posted formatted correctly, so SO thought it was an HTML tag. (You could have seen it if you viewed the source by hitting edit) — derobert
– derobert, Commented Feb 13, 2012 at 17:02
It is not clear what you want out of the search space. You have used the string "1234" four times in the search space, so when you say you're looking for 1234, we can't tell which one you mean. Also you say you want 1234 or 1234, 4657, which is a bit odd; I suspect you can get whatever you want, but you need to be clear about it. — zgpmax
– zgpmax, Commented Feb 13, 2012 at 17:09
Do you even understand what regular expressions are for? If you want a full fledged tested implementation nobody will help you. Try to improve the question or it's just a hit and miss. — AlfredoVR
– AlfredoVR, Commented Feb 14, 2012 at 17:59

Scott Weaver · Accepted Answer · 2012-02-13 17:08:40Z

1

you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:

/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/   (not tested!)

there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".

edited Feb 13, 2012 at 17:08

answered Feb 13, 2012 at 16:57

Scott Weaver

7,3832 gold badges33 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

KK99 Over a year ago

This also picks up the first digits in () which i don't need. I needed the numbers between > and :

Scott Weaver Over a year ago

your question wasn't quite clear, it does indeed match the first (). still, even with that regex you can use \2 and ignore \1.

KK99 Over a year ago

I modified and used this /.*?\(.*?\),*\s*<.*?>\s*(\d+(?:,\s*\d+)*):/

TLP · Accepted Answer · 2012-02-13 17:58:19Z

1

Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.

Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).

You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:

sub extract_nums {
    my $string = shift;
    if ($string =~ /<[^>]*> *([\d, ]+):/) {
        return $1 =~ /\d+/g;   # return the extracted digits in a list
        # return $1;           # just return the string as-is
    } else { return undef }
}

This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.

answered Feb 13, 2012 at 17:58

TLP

68.2k10 gold badges97 silver badges156 bronze badges

2 Comments

KK99 Over a year ago

Thanks. This seems good but guess I need to tweak a bit to get as separate numbers or even better with comma separated as-is

TLP Over a year ago

"to get as separate numbers or even better with comma separated as-is"? That is what the code above does. Didn't you try it out?

Feysal · Accepted Answer · 2012-02-13 16:50:15Z

0

There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?

(1234) First Last (ab-cd/ABC1), <[email protected]> 1224: efadsfadsfdsf

(1234) First Last (abcd/ABC12), <[email protected]> 1234, 4657: efadsfadsfdsf

If so, this should work:

\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\@\w*\.domain\.com>\s\d+(?:,\s(\d+))?:

answered Feb 13, 2012 at 16:50

Feysal

6234 silver badges7 bronze badges

1 Comment

derobert Over a year ago

I fixed the markdown syntax in OP's question, you can see the email addresses now.

AlfredoVR · Accepted Answer · 2012-02-14 00:10:02Z

0

$string =~ /.*>\s*(.+):.+/;
$numbers = $1;

That's it. Tested.

With number catching:

$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;

Not tested but you get the idea.

edited Feb 14, 2012 at 0:10

answered Feb 13, 2012 at 17:18

AlfredoVR

4,3673 gold badges27 silver badges33 bronze badges

3 Comments

Scott Weaver Over a year ago

but it also matches about 50% of the time just by banging on the keyboard at random: "asdkfja;dkfja;df>;al:kdjsakjdfa:akjhsdfjah" - and your "numbers" now = ";al" ? I think you need better than this, the goal should be the find the right balance between flexibility and rigidity.

AlfredoVR Over a year ago

I'll ad a number catcher for your vieweing pleasure.

TLP Over a year ago

@sweaver2112 A "faceroll" regex, very creative. :)

Collectives™ on Stack Overflow

Regular expression help in Perl

4 Answers 4

3 Comments

2 Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related