0

pandas df with 11 columns need to modify first 3 columns using regex and add a new column with this modified column and us this for downstream concatenation, something like this I need to keep the element as is of these columns and make it a unique string

column1 column2 column3 column4 ...column 11

need to do this new_col = column1:column2-column3(column4)

and make this new column,

column1 column2 column3 newcol column4 ...column 11

I could do this using simple python one line, not sure what is the syntax for pandas

l = cols[0] + ":" + cols[1] + "-" + cols[2] + "(" + cols[5] + ")"
2
  • Your example code will work fine if cols[0], cols[1], cols[2] and cols[5] are strings. If not, you need to convert them to strings before combining them. In standard python code you would do this with str(cols[0]). With pandas columns, you can do this with cols[0].astype(str). Commented Apr 19, 2017 at 19:27
  • Agree, but i would have still not known how to add a new column to existing df Commented Apr 19, 2017 at 19:38

3 Answers 3

3

You should just be able to do it with the same syntax you posted as long as all of the columns contain strings.

You can also use the Series.str.cat method.

df['new_col'] = cols[0].str.cat(':' + cols[1] + '-' + cols[2] + '(' + cols[5]+ ')')
Sign up to request clarification or add additional context in comments.

3 Comments

df1['unique_col'] = df1['chrom'].str.cat(':' + df1['start'] + '-' + df1['end'] + '(' + df1['strand'] + ')') gives me AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
@sbradbio as I said "as long as all of the columns contain strings", if not you will need to cast the as strings as piRsquared did with .astype(str)
gotcha! need some more coffee missed the string part thanks.
2

consider the dataframe df

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(a, (5, 10))).add_prefix('col ')

print(df)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9
0     Q     L     C     K     P     X     N     L     N     T
1     I     X     A     W     Y     M     W     A     C     A
2     U     Z     H     T     N     S     M     E     D     T
3     N     W     H     X     N     U     F     D     X     F
4     Z     L     Y     H     M     G     E     H     W     S

Construct a custom format function

f = lambda row: '{col 1}:{col 2}-{col 3}({col 4})'.format(**row)

And apply to df

df.astype(str).apply(f, 1)

0    L:C-K(P)
1    W:A-C(A)
2    W:H-X(N)
3    E:H-W(S)
4    Y:E-P(N)
dtype: object

Add a new column with assign

df.assign(New=df.astype(str).apply(f, 1))
# assign in place with
# df['New'] = df.astype(str).apply(f, 1)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

Or you can wrap this up into another function that operates on pd.Series. This requires that you pass the columns in the correct order.

def u(a, b, c, d):
    return a + ':' + b + '-' + c + '(' + d + ')'

df.assign(New=u(df['col 1'], df['col 2'], df['col 3'], df['col 4']))
# assign in place with
# df['New'] = u(df['col 1'], df['col 2'], df['col 3'], df['col 4'])

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

4 Comments

It's not clear what should be in a in the second line of the first code block.
@piRSquared worked like charm many thanks!!!! could you please explain what you just did in the second block of code (lambda) and assign?
I use assign because it creates a copy of the dataframe and I typically don't want to clobber your dataframe by writing over it. So I use assign. However, very ofter, you see answers assign to a new column in the same dataframe. That's perfectly fine. Just not how I usually do it.
In the second block of code... honestly, its the same as what @Grr did, except I wrapped it in a function that is more readable. By operating on the series as a whole, we avoid the inherent loop executed by apply.
1

Based on an answer that was recently deleted, this works fine:

df1 = pd.DataFrame({
    'chrom': ['a', 'b', 'c'], 
    'start': ['d', 'e', 'f'], 
    'end': ['g', 'h', 'i'], 
    'strand': ['j', 'k', 'l']}
)
df1['unique_col'] = df1.chrom + ':' + df1.start + '-' + df1.end + '(' + df1.strand + ')'

It sounds like your original dataframe may not contain strings. If it contains numbers, you need something like this:

df1 = pd.DataFrame({
    'chrom': [1.0, 2.0], 
    'start': [3.0, 4.0], 
    'end': [5.0, 6.0], 
    'strand': [7.0, 8.0]}
)
df1['unique_col'] = (
    df1.chrom.astype(str) + ':' 
    + df1.start.astype(str) + '-' + df1.end.astype(str)
    + '(' + df1.strand.astype(str) + ')'
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.