1

I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:

Text
Dimensions: 23"/60
Dimensions: 23" / 60
Dimensions: 48"
Dimensions: 22.5X8.25
Dimensions: 80IN
Dimensions: 567 S
Dimensions: 22.5X8.25
Dimensions: 26INNP
Dimensions: 24" x 55" with pipe 16 x 7

I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"

I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with '_'

df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))

Can someone tell me what I am doing wrong?

Edit:

Specifying rules to the pattern to solidify the regex:

  • Sentences containing words with numbers should be extracted (for sentence short A20-6014 complete, A20-6014 should be extracted)
  • Sentences beginning with words with numbers should be extracted (for sentence A20-6014 short complete, A20-6014 should be extracted)
  • Sentences ending with words with numbers should be extracted (for sentence short complete A20-6014, A20-6014 should be extracted)
  • Two letter words in between words containing numbers should be considered (ie. for sentence ELBOW- 2.5 IN 90 SIL, 2.5 IN 90 should be extracted)
  • Greater than 2 letter words in between words containing numbers should extracted separately and joined with a _. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentence ELBOW- 2.5 IND 90 SIL, 2.5_90 should be extracted. )
  • Another Test Case: for a sentence 90 CONN 3/4 ST-#8 FL, 90_3/4_ST-#8 should be extracted
0

1 Answer 1

3

You can use

regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')

See the regex demo.

Details:

  • \d+(?:\.\d+)? - an int or float value
  • "? - an optional " char
  • (?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)? - an optional sequence of
    • \s*[Xx/.]\s* - X, x, / or . enclosed with zero or more whitespaces
    • \d+(?:\.\d+)? - an int or float value
    • "? - an optional " char
  • (?:\s*[A-Z]+\b)? - an optional sequence of
    • \s* - zero or more whitespaces
    • [A-Z]+ - one or more ASCII uppercase letters
    • \b - a word boundary.

Here is a Pandas test:

import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')

Output:

>>> df
                                     Text      Numeric_Info
0                            No data here                  
1                      Dimensions: 23"/60            23"/60
2                    Dimensions: 23" / 60          23" / 60
3                         Dimensions: 48"               48"
4                   Dimensions: 22.5X8.25         22.5X8.25
5                        Dimensions: 80IN              80IN
6                       Dimensions: 567 S             567 S
7                   Dimensions: 22.5X8.25         22.5X8.25
8                      Dimensions: 26INNP            26INNP
9  Dimensions: 24" x 55" with pipe 16 x 7  24" x 55"_16 x 7
Sign up to request clarification or add additional context in comments.

5 Comments

@Parsh I know, I follow your post. The last test case is not in line with the other rules, so I do not think I can help more than coming up with this pattern.
could you help as to why this does not capture words like 1st, 2nd, 3rd and so on. Aren't they similar to the other ones?
@Parsh Please provide a sample string.
Sure. For instance, Dimensions: 80th flag. Should it not capture 80th
@Parsh you may use this regex then.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.