2

I want to split data located in one column into two separate columns based on the characters of rows. Here is the data:

3C-assembly|contig_93
ptg000037l  
3C-assembly|contig_94
ptg000039l  
3C-assembly|contig_95
ptg000043l  
3C-assembly|contig_96
ptg000196l  
ptg000060l  
3C-assembly|contig_97
ptg000083l  
ptg000083l  
3C-assembly|contig_98
ptg000117l  
ptg000005l  
3C-assembly|contig_99
ptg000123l  
ptg000123l  
ptg0001232  
ptg0001233  
    

I need to put all 3C-assembly|contig_ in the first column and all corresponding ptg000 in the second column:

3C-assembly|contig_93 ptg000037l
3C-assembly|contig_94 ptg000039l
3C-assembly|contig_95 ptg000043l
3C-assembly|contig_96 ptg000196l
3C-assembly|contig_96 ptg000060l
3C-assembly|contig_97 ptg000083l
3C-assembly|contig_97 ptg000083l
3C-assembly|contig_98 ptg000117l
3C-assembly|contig_98 ptg000005l
3C-assembly|contig_99 ptg000123l
3C-assembly|contig_99 ptg000123l
3C-assembly|contig_99 ptg0001232
3C-assembly|contig_99 ptg0001233
...........
3
  • what is the data type? witch language you are using? the data is ordered in some data structur? Commented May 19, 2021 at 16:23
  • 1
    Hi Karim, Welcome to Stack overflow... Please see Ido's questions and add a bit more detail to your question. Please also share anything you've tried or started, as that is often a helpful way to get clear suggestions. Thanks! Commented May 19, 2021 at 16:27
  • I noticed you had a trailing "l" (el) instead fo a 1 (one). It doesn't affect the problem as stated, but I wondered if it affects your data integrity? Commented May 19, 2021 at 16:48

2 Answers 2

1

Here's an R answer. If you create a grouping vector using cumsum on the presence of "3C" (or some other identifier for your groups, perhaps the "|"-character) you can then split and use the first item (one time) to any remaining items via R's recycling convention for dataframe definition:

dat <- read.table(text=txt)  # copied your data into txt
dat <- cbind(dat, grp=cumsum( grepl("3C", dat$V1) ))
#grepl pattern could have been "assembly" if that were more general

   do.call(rbind,  lapply( split(dat, dat$grp), 
           function(x) data.frame(
                            group=x[1,1], # first gets recycled
                            item=x[-1,1]) )  ) # the rest
                    group       item
1   3C-assembly|contig_93 ptg000037l
2   3C-assembly|contig_94 ptg000039l
3   3C-assembly|contig_95 ptg000043l
4.1 3C-assembly|contig_96 ptg000196l
4.2 3C-assembly|contig_96 ptg000060l
5.1 3C-assembly|contig_97 ptg000083l
5.2 3C-assembly|contig_97 ptg000083l
6.1 3C-assembly|contig_98 ptg000117l
6.2 3C-assembly|contig_98 ptg000005l
7.1 3C-assembly|contig_99 ptg000123l
7.2 3C-assembly|contig_99 ptg000123l
7.3 3C-assembly|contig_99 ptg0001232
7.4 3C-assembly|contig_99 ptg0001233
Sign up to request clarification or add additional context in comments.

1 Comment

Excellent. Thank you. Worked well.
0

In python:

#Assuming the data is in pandas dataframe. I just created it:

import pandas as pd
a=[
"3C-assembly|contig_93 ptg000037l",
"3C-assembly|contig_94 ptg000039l",
"3C-assembly|contig_95 ptg000043l",
"3C-assembly|contig_96 ptg000196l",
"3C-assembly|contig_96 ptg000060l",
"3C-assembly|contig_97 ptg000083l",
"3C-assembly|contig_97 ptg000083l",
"3C-assembly|contig_98 ptg000117l",
"3C-assembly|contig_98 ptg000005l",
"3C-assembly|contig_99 ptg000123l",
"3C-assembly|contig_99 ptg000123l",
"3C-assembly|contig_99 ptg0001232",
"3C-assembly|contig_99 ptg0001233"]

a=pd.DataFrame(a, columns=["data"])

#Define Function to SPlit and Extract
def ExtractContig(Name):
    #Split Based on Space
    splitgroup=Name.strip().split(' ')
    contigselect = splitgroup[0]
    ptgselect=splitgroup[1]
    
    # Split Based on Underscore to get first column
    contig = contigselect.strip().split('_')[-1]    
    
    #Split Based on "g" of the string ptgxxxxxx
    ptg = ptgselect.strip().split('g')[-1]    
    return   contig,ptg

#Function Call and Collect Title for Each rows
a['data'].apply(lambda Name: ExtractContig(Name))

You can store and perform further analysis. The output in this case is:

0     (93, 000037l)
1     (94, 000039l)
2     (95, 000043l)
3     (96, 000196l)
4     (96, 000060l)
5     (97, 000083l)
6     (97, 000083l)
7     (98, 000117l)
8     (98, 000005l)
9     (99, 000123l)
10    (99, 000123l)
11    (99, 0001232)
12    (99, 0001233)
Name: data, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.