2

I am trying to create a matrix from a data frame and a list. The list and column 1 of the data frame contain the same strings, however, not all of the strings in the list are in the column 1 and are not in the same order (see example below). I would like to search through the data frame, and print the data in the second column if the string in column 1 matches a string in the list, else print the string in seqList and 0, NaN or missing etc. I thought that pandas would be good for this as I can compare columns in a data frame using df.equals, but it reports false even when the strings are present and should match.

I think this may be because I have more strings in the seqList than in the data frame and they're not in the same order. I therefore, tried to index the data frame, but my data in column 2 is lost/replaced with NaN.


List

seqList = ['Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1', 'Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1', 'Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1', 'Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1', 'Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1']

Infile for dataframe

#Infile2:

Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1   1
Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1    3
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1    3

I want to create a new matrix which contains all of the sequences in the list (seqList) and the number of occurrences identified in infile2.

Desired output

#outfile:
sequence    hits
Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1    3
Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1    1
Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1    NaN
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1    3
Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1    NaN

I have loaded infile2 as a dataframe and named columns:

#Create the dataframe from the sequnce hits in the genomes (identified in the occurances file).
Occurences=pd.read_csv(infile2, delimiter='\t', index_col=False)    #Read the input file as a tab separated dataframe.
pd.set_option("display.max_colwidth", None) #Ensure that the sequence names are not cut off.
Occurences.rename(columns = {list(Occurences)[0]: 'sequence'}, inplace = True) #Name the sequences column
Occurences.rename(columns = {list(Occurences)[1]: 'hits'}, inplace = True) #Name the occurences column

I have tried to convert seqList to a data frame and then use .equals (as shown here) but this still reports the match as false:

SeqDataFrame= pd.DataFrame (seqList, columns = ['sequence']) #Load seqList as df
result = SeqDataFrame['sequence'].equals(Occurences['sequence'])  #Use .equals to compare the sequence columns and report matching
print(result)
False

I think that the issue is that the order of strings in the sequence column in the occurrences df is not in the same order as seqList. I have therefore tried to index the occurrences data frame using seqList, but this seems to lose all of the data in the hits column.

Occurences.set_index('sequence', inplace=True)
Occurences = Occurences.reindex(seqList)
print(Occurences)
                                                                             
    hits
sequence                                                                                  
Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1                                                      NaN
Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1                                                     NaN
Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1                                                     NaN
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1                                                  NaN
Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1                                                  NaN

I have looked for similar questions, but none seem to have an issue with the order of the columns not matching. And if it is a question specifically about columns not matching, they have reindexed as I have and haven't lost data. How do I create my desired matrix which contains all of the sequences in seqList and the number of hits identified in the Occurences data frame?

Many thanks in advance


n.b. I have also tried to use pd.merge to merge the list and data frame, but for some reason this creates an empty data frame:

MergedFrames = pd.merge(SeqDataFrame, Occurences, left_on=["sequence"], right_on=['sequence'])
print("MergedFrames")
print(MergedFrames)

MergedFrames
Empty DataFrame
Columns: [sequence, hits]
Index: []
2
  • Why is the number of hits 4 for sequence Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1 in the output? Commented May 24, 2022 at 13:38
  • 1
    @user2246849 - apologies it is a typo. I will amend it Commented May 24, 2022 at 13:54

2 Answers 2

1

You can use DataFrame.reindex:

Occurences.set_index('sequence').reindex(seqList).reset_index()
                                sequence  hits
0      Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1   3.0
1     Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1   1.0
2     Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1   NaN
3  Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1   3.0
4  Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1   NaN

If your list can have duplicates just use list(set(seqList)).

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your suggestion, I have tried this, but I also get NaN for each row in 'hits'. Any ideas as to why? Occurances = Occurances.set_index('sequence').reindex(seqList).reset_index()
@Jpike are you sure the sequence values in Occurrences exactly match those in the list? Maybe you have trailing spaces? Try Occurrences['sequence'] = Occurrences['sequence'].str.strip() before this solution. Otherwise, check what is the output of Occurrences[Occurrences['sequence'].isin(seqList)]
Occurrences['sequence'] = Occurrences['sequence'].str.strip() fixed it, there must have been trailing spaces. I will go back and look at the script i used to create the occurrences infile, I must have added a space somewhere. Many thanks for your help : )
0

Supposing an element can appear several times in seqList:

seqDF = pd.DataFrame({'results': seqList})
df = pd.DataFrame({'diag': ['Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1', 'Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1','Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1'],
                 'occ': [1, 3, 3]})
mergeDF = seqDF.merge(df, how='left', left_on='results', right_on='diag')
mergeDF[['results', 'occ']].groupby('results')[['occ']].sum()

gives:

Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1   3.0
Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1  1.0
Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1  0.0
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1   3.0
Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1   0.0

Since you want the number of occurrences, I have assumed that 0.0 is more coherent than NaN

3 Comments

Thank you for your solution, I have tried this, replacing the data in diag and occ seqDF = pd.DataFrame({'results': seqList}) df = pd.DataFrame({'diag': Occurences['sequence'], 'occ': Occurences['hits']}) mergeDF = seqDF.merge(df, how='left', left_on='results', right_on='diag') mergeDF[['results', 'occ']].groupby('results')[['occ']].sum() But I still get NaN for each value. Will referring to the Occurences df here not work?
interesting that it doesn't even output 0.0. It is NaN. Does this mean that the strings don't match at all? Perhaps there is a hidden character?
I guess there are hidden characters, because, on my side, it works. Are you aware that user2246849's solution won't work with repeated values in seqList ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.