0

I've seen many posts on this but I still can't get it to work, I have no idea why.

What I have is a relatively simple strings with some floating point and integer numbers in it, e.g.: '2 1.000000000000000 1 1 0'. I want to extract only the integers from it, in this example only 2, 1, 1, 0 (not the 1 that's followed by 0s).

I know I have to use lookbehind and lookahead to rule out numbers that are preceded or followed by a .. I can successfully find the numbers that are preceded by a coma, in the said case the 0:

import re
IntegerPattern = re.compile('-?(?<=\.)\d+(?!\.)')
a = '2   1.000000000000000       1   1 0'
IntegerPattern.findall(a)

will return ['000000000000000'], exactly as I want. But when I try to find numbers that are not preceded by .s this doesn't work:

import re
IntegerPattern = re.compile('-?(?<!\.)\d+(?!\.)')
a = '2   1.000000000000000       1   1 0'
IntegerPattern.findall(a)

returns ['2', '00000000000000', '1', '1', '0']. Any ideas why? I'm completely new to regular expressions in general and this just eludes me. It should work but it does not. Any help would be appreciated.

0

4 Answers 4

3

Use the regex

(\s|^)\d+(\s|$)

the code can be

>>>  n='2 1.000000000000000 1 1 0'
>>> re.findall(r'(?:\s|^)\d+(?:\s|$)', n)
['2 ', ' 1 ', ' 0']

(\s*|^) matches a space or start of string

\d+ matches any number of digits

(\s*|$) matches space or end of string

NOTE: \b cannot be used to delimit \d+ as . is also included in \b

Example http://regex101.com/r/gP1nK0/1

EDIT

Why doesnt the regex (?<!\.)\d+(?!\.) work

now here the problem is when using look negative around assertions, we try to not to match . and the regex tries to match .

when you write (?<!\.) the regex finds a position where it can be successfull

that is in say 1.000000 the regex fixes the position second 0 such that the previous position is not . (which is zero) and remaining is 00000 thus winning. Hence it matches it

to get a clearer view check this link

http://regex101.com/r/gP1nK0/2

as you can see for the 1.000000000000000 the match occures from second 0 making it successfull

EDIT 1

a more perfect regex would be like

(?:(?<=^)|(?<=\s))\d+(?=\s|$)

>>>n
'1 2 3 4.5'
>>> re.findall(r'(?:(?<=^)|(?<=\s))\d+(?=\s|$)', n)
['1', '2', '3']
>>> n='1 2 3 4'
>>> re.findall(r'(?:(?<=^)|(?<=\s))\d+(?=\s|$)', n)
['1', '2', '3', '4']

Thank you sln for pointing that out

Sign up to request clarification or add additional context in comments.

7 Comments

OK, thanks, but this is partially a learning experience. If I had a string without preceding whitespaces, what should I do? And why doesn't my expression work?
@AleksanderLidtke, to answer why your question doesn't work. If you count the number of 0's you'll notice it's one less. From the second 0 on, it isn't immediately preceded by a \. so it passes.
@smerny OK that explains it, thanks. Didn't actually think of counting the 0s.
@AleksanderLidtke i have added an edit on why it doesnt work. hope it helps you :)
Doesn't match 2 and 4 in 1 2 3 4
|
2

I wouldn't bother with regexes:

 s = '2   1.000000000000000       1   1 0'

 print [int(part) for part in s.split() if "." not in part]

It's often much simpler to work with basic string manipulation, or as the old saying goes "I had a problem I tried to solve with regexes. Then I had two problems"

Comments

1
a = '-2   1.000000000000000       1   1 0'
print([x for x in a.split() if x[1:].isdigit() or x.isdigit()])
['-2', '1', '1', '0']

If you want the digits before the . also:

a = '2   1.000000000000000       1   1 0'


print([x if x.isdigit() else x.split(".")[0] for x in a.split() ])
['2', '1', '1', '1', '0']

2 Comments

@JoranBeasley, cheers, I was thinking that it would not work for negative nums but the OP does not seem to be checking for it in their regex so I guess it should work :P
@smerny, it does get the first 1
0

The engine is compensating to match.
It sheds a \d on the left, then matches.

This ensures no digits are shed on the left -

 # (?<![.\d])\d+(?!\.)

 (?<! [.\d] )
 \d+ 
 (?! \. )

Just a note - In your first pattern -?(?<=\.)\d+(?!\.)
The -? will never actually match a dash because it is not a \. which the assertion
states must be there.
The rule is never point an assertion in a direction that directly contains a literal
unless the literal is included in the assertion. In this case it is out of order anyway,
so the -? is entirely useless.

2 Comments

When I run your pattern: pat=re.compile('(?<! [.\d] )\d+(?! \. )') with my original example a = '2 1.000000000000000 1 1 0' I still get ['2', '1', '000000000000000', '1', '1']. And well pointed out with the -?, it was just a leftover from the original pattern.
I'm afraid that is quite impossible ! And pat=re.compile('(?<! [.\d] )\d+(?! \. )') isn't my pattern.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.