3

I have this following data frame (df):

mut   gene   pvalue    chrom
1:23456_A>G  0.005     chr1  
2:28484_A>G  0.0001    chr2
4:47629_A>G  0.05      chr4
3:88382_A>G  0.00001   chr3
10:88273_A>G 0.005    chr10

[30 rows x 4 columns]

I am trying to create four columns along with their column name labels from the "mut" column of df and assigned it into newly created df_new that looks like this

chr    st    ref   alt 
1     23456   A     G  
2     28484   A     G  
4     47629   A     G

The resulted data frame (df_new) is basically an extraction of column mut from df and then a separation of each part of the string, i.e: split(":") then split("_") and finally split(">") where we end up with 4 parts of the original field 1 23456 A G and then placed into their columns.

Here is my attempt:

df_new["chr"], df_new["st"], df_new["ref"],    
df_new["alt"] = df.mut.str.split("[:_>]")

but I end up with an error message as the following:

ValueError: too many values to unpack (expected 4)

a simple print statement reveals the result of this line of code:

 df.mut.str.split("[:_>]")

as:

0   [1, 23456, A, G]  
1   [2, 28484, A, G]
        .
        .
        .

Is there a way to solve this in pandas where you create a new data frame from the separation of the string fields into 4 columns with their columns labels included?

1 Answer 1

8

Lets try .str.split(expand=True)

df2=df.mut.str.split('[:_>]',expand=True)
df2.columns=['chr','st','ref','alt']



 chr     st ref alt
0   1  23456   A   G
1   2  28484   A   G
2   4  47629   A   G
3   3  88382   A   G
4  10  88273   A   G
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.