string operation on pandas df

Question

pandas df with 11 columns need to modify first 3 columns using regex and add a new column with this modified column and us this for downstream concatenation, something like this I need to keep the element as is of these columns and make it a unique string

column1 column2 column3 column4 ...column 11

need to do this new_col = column1:column2-column3(column4)

and make this new column,

column1 column2 column3 newcol column4 ...column 11

I could do this using simple python one line, not sure what is the syntax for pandas

l = cols[0] + ":" + cols[1] + "-" + cols[2] + "(" + cols[5] + ")"

Your example code will work fine if cols[0], cols[1], cols[2] and cols[5] are strings. If not, you need to convert them to strings before combining them. In standard python code you would do this with str(cols[0]). With pandas columns, you can do this with cols[0].astype(str). — Matthias Fripp
– Matthias Fripp, Commented Apr 19, 2017 at 19:27
Agree, but i would have still not known how to add a new column to existing df — sbradbio
– sbradbio, Commented Apr 19, 2017 at 19:38

Grr · Accepted Answer · 2017-04-19 18:33:15Z

3

You should just be able to do it with the same syntax you posted as long as all of the columns contain strings.

You can also use the Series.str.cat method.

df['new_col'] = cols[0].str.cat(':' + cols[1] + '-' + cols[2] + '(' + cols[5]+ ')')

answered Apr 19, 2017 at 18:33

Grr

16.2k7 gold badges72 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

sbradbio Over a year ago

df1['unique_col'] = df1['chrom'].str.cat(':' + df1['start'] + '-' + df1['end'] + '(' + df1['strand'] + ')') gives me AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Grr Over a year ago

@sbradbio as I said "as long as all of the columns contain strings", if not you will need to cast the as strings as piRsquared did with .astype(str)

sbradbio Over a year ago

gotcha! need some more coffee missed the string part thanks.

piRSquared · Accepted Answer · 2017-04-19 19:14:28Z

2

consider the dataframe df

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(a, (5, 10))).add_prefix('col ')

print(df)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9
0     Q     L     C     K     P     X     N     L     N     T
1     I     X     A     W     Y     M     W     A     C     A
2     U     Z     H     T     N     S     M     E     D     T
3     N     W     H     X     N     U     F     D     X     F
4     Z     L     Y     H     M     G     E     H     W     S

Construct a custom format function

f = lambda row: '{col 1}:{col 2}-{col 3}({col 4})'.format(**row)

And apply to df

df.astype(str).apply(f, 1)

0    L:C-K(P)
1    W:A-C(A)
2    W:H-X(N)
3    E:H-W(S)
4    Y:E-P(N)
dtype: object

Add a new column with assign

df.assign(New=df.astype(str).apply(f, 1))
# assign in place with
# df['New'] = df.astype(str).apply(f, 1)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

Or you can wrap this up into another function that operates on pd.Series. This requires that you pass the columns in the correct order.

def u(a, b, c, d):
    return a + ':' + b + '-' + c + '(' + d + ')'

df.assign(New=u(df['col 1'], df['col 2'], df['col 3'], df['col 4']))
# assign in place with
# df['New'] = u(df['col 1'], df['col 2'], df['col 3'], df['col 4'])

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

edited Apr 19, 2017 at 19:14

answered Apr 19, 2017 at 18:48

piRSquared

296k68 gold badges509 silver badges654 bronze badges

4 Comments

Matthias Fripp Over a year ago

It's not clear what should be in a in the second line of the first code block.

sbradbio Over a year ago

@piRSquared worked like charm many thanks!!!! could you please explain what you just did in the second block of code (lambda) and assign?

piRSquared Over a year ago

I use assign because it creates a copy of the dataframe and I typically don't want to clobber your dataframe by writing over it. So I use assign. However, very ofter, you see answers assign to a new column in the same dataframe. That's perfectly fine. Just not how I usually do it.

piRSquared Over a year ago

In the second block of code... honestly, its the same as what @Grr did, except I wrapped it in a function that is more readable. By operating on the series as a whole, we avoid the inherent loop executed by apply.

Matthias Fripp · Accepted Answer · 2017-04-19 19:15:41Z

1

Based on an answer that was recently deleted, this works fine:

df1 = pd.DataFrame({
    'chrom': ['a', 'b', 'c'], 
    'start': ['d', 'e', 'f'], 
    'end': ['g', 'h', 'i'], 
    'strand': ['j', 'k', 'l']}
)
df1['unique_col'] = df1.chrom + ':' + df1.start + '-' + df1.end + '(' + df1.strand + ')'

It sounds like your original dataframe may not contain strings. If it contains numbers, you need something like this:

df1 = pd.DataFrame({
    'chrom': [1.0, 2.0], 
    'start': [3.0, 4.0], 
    'end': [5.0, 6.0], 
    'strand': [7.0, 8.0]}
)
df1['unique_col'] = (
    df1.chrom.astype(str) + ':' 
    + df1.start.astype(str) + '-' + df1.end.astype(str)
    + '(' + df1.strand.astype(str) + ')'
)

answered Apr 19, 2017 at 19:15

Matthias Fripp

18.9k5 gold badges36 silver badges49 bronze badges

Collectives™ on Stack Overflow

string operation on pandas df

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related