1

I need to extract the volume with regular expression from strings like "Candy BAR 350G" (volume = 350G),

"Gin Barrister 0.9ml" (volume = 0.9ml),

"BAXTER DRY Gin 40% 0.5 ml" (volume = 0.5 ml),

"SWEET CORN 340G/425ML GLOBUS" (volume = 340G/425ML)

I tried using '\d+\S*[gGMmLl]'

and it worked well, but I faced strings like "Candies 2x150G" (volume that I need is 150G but I get 2x150G) or

"FOOD DYES 3COL.9G" (I need 9G however I get 3COL.9G)

I don't know what else add to regular expression

1
  • You want '\s', not '\S' Commented Jul 25, 2024 at 12:34

1 Answer 1

0

Let's start with the full code, and we can break it down into smaller blocks:

import re

fluids = [
    "Candy BAR 350G",
    "Gin Barrister 0.9ml",
    "BAXTER DRY Gin 40% 0.5 ml",
    "SWEET CORN 340G/425ML GLOBUS",
    "Candies 2x150G",
    "FOOD DYES 3COL.9G"
]

pattern = r"(\d[\d.]{0,})\s?(ml|g)"

for fluid in fluids:
    print(re.findall(pattern, fluid, flags=re.IGNORECASE))

which produces

[('350', 'G')]
[('0.9', 'ml')]
[('0.5', 'ml')]
[('340', 'G'), ('425', 'ML')]
[('150', 'G')]
[('9', 'G')]

Note first, that we make our lives simpler by passing the regex flag re.IGNORECASE. We also make sure the pattern is a raw string using r"..." so that Python doesn't get funny about the backslashes in the pattern (it thinks the user is trying to escape characters in the string otherwise, when that is not our intention).

If a Python regex pattern is passed anything inside of (...) brackets without any assertions like ?= or ?!, it becomes a capturing group. Depending on the level of nesting, you're telling the regex method exactly what part of the pattern you're interested in returning to the user. We use capturing groups to make sure that we don't capture any whitespace text (which we search for using \s?), and instead grab the quantity (\d[\d.]{0,}) and unit terms (ml|g). Because the capture groups for volume and units are at the same level of nesting, they get returned as a tuple when discovered by re.findall.

The numbers were captured using the regex pattern \d[\d.]{0,} which says look for something that has to start with a digit (\d) and is then followed by any combination of the characters ([\d.]) (representing any digit or a full stop) from zero to any amount of repetition ({0,}).

The units are captured with ml|g, telling the interpreter to either match the ml or g in the second capture group.

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.