Extracting text from elements in Pandas column, writing to new column

Question

I've got some data in a column (COL_NAME) of a Pandas DataFrame. I'd like to extract some text between '(' and ')' (this data either exists, or the parens don't exist at all, although there may be more than one set of parens in the data). I'd then like to write the data in the paren to another column, and then remove the '(XXX)' from the original string.

I.e.

COL_NAME
========
(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)

turns in to:

COL_NAME          NEW_COL
========          =======
text (yay!)       info
i love text       None
Text is fun       None
more text         more info
lots text (boo!)  None

I can do this by isolating the column, iterating through its elements, splitting on the (, creating two new lists and then adding them to the DataFrame, but there's assuredly a way more Pythonic/Pandic way of doing this, right?

Thanks!

Can you show us your code when you iterate through the elements? I don't want to write the split function (and have to guess what it does). — IanS
– IanS, Commented Oct 29, 2015 at 8:30
I highly doubt that he coded lotsa text (boo!) to lots text (boo!) — WoodChopper
– WoodChopper, Commented Oct 29, 2015 at 9:14
Because the text I'm tearing has some stuff in parens at the beginning of the string that I'm interested in, but may also have some other stuff in parens at the end of the string that I want to leave along. — user770901
– user770901, Commented Nov 5, 2015 at 6:44

jezrael · Accepted Answer · 2015-10-29 13:45:30Z

7

It isn't clear, why second parentheses doesn't match. Maybe because char !.

Then you can use extract with regular expression.

Regular expression \(([A-Za-z0-9 _]+)\) means:

\( matches a literal ( character
( begins a new group
[A-Za-z0-9 _] is a character set matching any letter (capital or lower case), digit or underscore and space
+ matches the preceding element (the character set) one or more times.
) ends the group
\) matches a literal ) character

Second parentheses isn't matched, because regex exclude character ! - it isn't in brackets [A-Za-z0-9 _].

import pandas as pd
import numpy as np
import io

temp=u"""(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)"""

df = pd.read_csv(io.StringIO(temp), header=None, names=['original'])
print df
#                  original
#0       (info) text (yay!)
#1              I love text
#2              Text is fun
#3  (more info) more text
#4        lotsa text (boo!)

df['col1'] = df['original'].str.extract(r"\(([A-Za-z0-9 _]+)\)")
df['col2'] = df['original'].str.replace(r"\(([A-Za-z0-9 _]+)\)", "")
print df
#                original       col1               col2
#0     (info) text (yay!)       info        text (yay!)
#1            I love text        NaN        I love text
#2            Text is fun        NaN        Text is fun
#3  (more info) more text  more info          more text
#4      lotsa text (boo!)        NaN  lotsa text (boo!)

answered Oct 29, 2015 at 13:45

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user770901 Over a year ago

This works great! Thank you so much. One note: @jezrael, is that it leaves a leading space on the "leftover" text in col2. Now there's the issue of applying lstrip to an entire column of a dataframe...But thank you for the well explained answer!

Collectives™ on Stack Overflow

Extracting text from elements in Pandas column, writing to new column

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related