1

I have some text data in (Column 1), and was wondering if I could extract a specific sequence from the rows in that column and add them to a new column.

For example:

  (column1)
Coke Can 300ml
Bottle 800ml
Cup
Bucket 2000ml

Turns into:

(column1)          (column2)
 Coke Can            300ml
 Bottle              800ml
 Cup                 N/A
 Bucket              20000ml

Basically, I want to extract every phrase with "xxml" and insert that into a new column. Thank you for the help!

5

4 Answers 4

3

use pandas str extract to search for numbers followed by 'ml'

  df['(column2)'] = df.iloc[:,0].str.extract(r'(\d+ml)')

    (column1)      (column2)
0   Coke Can 300ml  300ml
1   Bottle 800ml    800ml
2   Cup             NaN
3   Bucket 2000ml   2000ml

If you want to remove the 'ml' after the digits, you can use regex look behind assertion ... it will only look for 'ml' after digits and replace it with an empty string

df.iloc[:,0] = df.iloc[:,0].str.replace('(?<=\d)ml','')

    (column1)   (column2)
0   Coke Can 300    300ml
1   Bottle 800      800ml
2   Cup             NaN
3   Bucket 2000     2000ml
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! This worked. Now, is there a way to remove the ml values in the original column?
added a solution to cover removing 'ml' values after the digit
1

use pandas.extractall to extract into various columns.

import pandas as pd
df = pd.DataFrame(dict(
    col1 = ['Coke Can 300ml', 'Bottle 800ml', 'Cup', 'Bucket 2000ml']))
print(df.to_markdown())
|    | col1           |
|---:|:---------------|
|  0 | Coke Can 300ml |
|  1 | Bottle 800ml   |
|  2 | Cup            |
|  3 | Bucket 2000ml  |

import re
df=df['col1'].str.extractall('([a-z ]+)(\d+)?([a-z]+)?',flags=re.I)
print(df.to_markdown())

|        | 0        |    1 | 2   |
|:-------|:---------|-----:|:----|
| (0, 0) | Coke Can |  300 | ml  |
| (1, 0) | Bottle   |  800 | ml  |
| (2, 0) | Cup      |  nan | nan |
| (3, 0) | Bucket   | 2000 | ml  |

Comments

0

You might want to try this.

df['new_column'] = df['column'].apply(lambda x: x.split()[-1] if len(x.split()) > 1 else None) 

Comments

0

Given

df = pd.DataFrame(dict(
    col1 = ['Coke Can 300ml', 'Bottle 800ml', 'Cup', 'Bucket 2000ml'])
)

the following might be what you're after here:

In [13]: df.col1.str.split(' ', expand=True, n = 1)
Out[13]:
        0          1
0    Coke  Can 300ml
1  Bottle      800ml
2     Cup       None
3  Bucket     2000ml

However, this is splitting on the first whitespace from the right of the column values.

For this the answer you have from @sammywemmy seems best, I'm simply putting this here as it might be of interest.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.