0

I have following text pattern

(2222) First Last (ab-cd/ABC1), <[email protected]> 1224: efadsfadsfdsf

(3333) First Last (abcd/ABC12), <[email protected]> 1234, 4657: efadsfadsfdsf

I want the number 1224 or 1234, 4657 from the above text after the text >.

I have this \((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\@\w*\.domain.com>\s\d+: which will take the text before : But i want the one after email till :

Is there any easy regular expression to do this? or should I use split and do this

Thanks

Edit: The whole text is returned by a command line tool.

(3333) First Last (abcd/ABC12), <[email protected]> 1234, 4657: efadsfadsfdsf

(3333) - Unique ID

First Last - First and last names

<[email protected]> - Email address in format [email protected]

1234, 4567 - database primary Keys

: xxxx - Headline

What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables

The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.

My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this

4
  • 1
    There are no e-mail addresses in your "following text pattern". Please provide examples which are closer to your real-world data. Commented Feb 13, 2012 at 16:49
  • 1
    @hochgurgler: Yes there is, OP just didn't have it posted formatted correctly, so SO thought it was an HTML tag. (You could have seen it if you viewed the source by hitting edit) Commented Feb 13, 2012 at 17:02
  • 2
    It is not clear what you want out of the search space. You have used the string "1234" four times in the search space, so when you say you're looking for 1234, we can't tell which one you mean. Also you say you want 1234 or 1234, 4657, which is a bit odd; I suspect you can get whatever you want, but you need to be clear about it. Commented Feb 13, 2012 at 17:09
  • Do you even understand what regular expressions are for? If you want a full fledged tested implementation nobody will help you. Try to improve the question or it's just a hit and miss. Commented Feb 14, 2012 at 17:59

4 Answers 4

1

you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:

/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/   (not tested!)

there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".

Sign up to request clarification or add additional context in comments.

3 Comments

This also picks up the first digits in () which i don't need. I needed the numbers between > and :
your question wasn't quite clear, it does indeed match the first (). still, even with that regex you can use \2 and ignore \1.
I modified and used this /.*?\(.*?\),*\s*<.*?>\s*(\d+(?:,\s*\d+)*):/
1

Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.

Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).

You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:

sub extract_nums {
    my $string = shift;
    if ($string =~ /<[^>]*> *([\d, ]+):/) {
        return $1 =~ /\d+/g;   # return the extracted digits in a list
        # return $1;           # just return the string as-is
    } else { return undef }
}

This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.

2 Comments

Thanks. This seems good but guess I need to tweak a bit to get as separate numbers or even better with comma separated as-is
"to get as separate numbers or even better with comma separated as-is"? That is what the code above does. Didn't you try it out?
0

There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?

(1234) First Last (ab-cd/ABC1), <[email protected]> 1224: efadsfadsfdsf

(1234) First Last (abcd/ABC12), <[email protected]> 1234, 4657: efadsfadsfdsf

If so, this should work:

\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\@\w*\.domain\.com>\s\d+(?:,\s(\d+))?:

1 Comment

I fixed the markdown syntax in OP's question, you can see the email addresses now.
0
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;

That's it. Tested.

With number catching:

$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;

Not tested but you get the idea.

3 Comments

but it also matches about 50% of the time just by banging on the keyboard at random: "asdkfja;dkfja;df>;al:kdjsakjdfa:akjhsdfjah" - and your "numbers" now = ";al" ? I think you need better than this, the goal should be the find the right balance between flexibility and rigidity.
I'll ad a number catcher for your vieweing pleasure.
@sweaver2112 A "faceroll" regex, very creative. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.