Matching in a fuzzy manner a number in Python

Question

I have the following problem: I have strings that contain numbers that may include dots or commas. E.g.:

text = 'ην Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό των δέκα χιλιάδων διακόσια (10.200) ευρώ, διαιρούμενο σε δέκα χιλιάδες διακόσια (10.200) εταιρικά μερίδια, ονομαστικής αξίας ενός (1) ευρώ το καθένα, το οποίο καλύφθηκε ολοσχερώς'

Then I have the number without any symbols, e.g. '10200'.

I would like to find the location of the substring '10.200' within the string.

I guess one way would be to create a method that would insert dots in the number.

But another way would be to perform some form of fuzzy matching.

To that end, I experimented with the regex module but not successfully. I.e.:

import regex
regex.search('(10200){i}', f'{text}' )

Returns:

<regex.Match object; span=(1, 154), match='ν Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό \nτων δέκα χιλιάδων διακόσια (10.200', fuzzy_counts=(0, 148, 0)>

So, it does not match 10.200 as I had hoped.

What would you suggest?

Hi, did my answer help?

Wiktor Stribiżew
– Wiktor Stribiżew

2020-12-09 12:20:49 +00:00
Commented Dec 9, 2020 at 12:20 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 9, 2020 at 12:20

Wiktor Stribiżew · Accepted Answer · 2020-07-02 20:18:42Z

If you want to match the closest match when performing fuzzy regex matching with PyPi regex module you need to use the regex.ENHANCEMATCH option, or its (?e) inline modifier version:

import regex

text = "ην Θεσσαλονίκη και κατοικεί στην Καλαμαριά Θεσσαλονίκης, (οδός Επανομής 32)Το κεφάλαιο της εταιρείας ορίζεται στο ποσό των δέκα χιλιάδων διακόσια (10.200) ευρώ, διαιρούμενο σε δέκα χιλιάδες διακόσια (10.200) εταιρικά μερίδια, ονομαστικής αξίας ενός (1) ευρώ το καθένα, το οποίο καλύφθηκε ολοσχερώς"
m = regex.search('(?e)(?:10200){i}', text )
if m:
  print( m.group() )

Returns 10.200.

Moreover, you know that there can be a dot anywhere in between, so you may tell the regex engine to only allow at most 1 insertion using the {i<=1} quantifier:

m2 = regex.search('(?:10200){i<=1}', text )
if m2:
  print( m2.group() )

Now, even without the ENHANCEMATCH option, you get the expected output.

See the Python demo online.

Now, the best solution would be to tell the PyPi regex engine to only allow the . char insertion using {i<=1:[.]} quantifier:

regex.search(r'(?:10200){i<=1:[.]}', text )

The (?:10200){i<=1:[.]} pattern matches 10200 with potentially one single insertion of a dot somewhere in between 1, 0, 2, 0 and 0.

score 0 · Accepted Answer · 2020-07-04 17:12:15Z

0

It's a little unclear what you mean by fuzzy. This is a guess that you want to match a number with a dot within a span of a fixed number, string 10200 in this case.

Could create the regex like this:

(Edit update: fixed a typo)

(?<![\d.])(?=\d+\.\d+(?![\d.]))1\.?0\.?2\.?0\.?0(?![\d.])

see https://regex101.com/r/QM5W0m/1

The asserts just limit the number to having a single dot after the opening digit and before the closing digit.

edited Jul 4, 2020 at 17:12

answered Jul 2, 2020 at 18:10

user13843220

Collectives™ on Stack Overflow

Matching in a fuzzy manner a number in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related