Pandas: Create table from data frame matching columns to a list

Question

I am trying to create a matrix from a data frame and a list. The list and column 1 of the data frame contain the same strings, however, not all of the strings in the list are in the column 1 and are not in the same order (see example below). I would like to search through the data frame, and print the data in the second column if the string in column 1 matches a string in the list, else print the string in seqList and 0, NaN or missing etc. I thought that pandas would be good for this as I can compare columns in a data frame using df.equals, but it reports false even when the strings are present and should match.

I think this may be because I have more strings in the seqList than in the data frame and they're not in the same order. I therefore, tried to index the data frame, but my data in column 2 is lost/replaced with NaN.

List

seqList = ['Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1', 'Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1', 'Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1', 'Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1', 'Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1']

Infile for dataframe

#Infile2:

Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1   1
Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1    3
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1    3

I want to create a new matrix which contains all of the sequences in the list (seqList) and the number of occurrences identified in infile2.

Desired output

#outfile:
sequence    hits
Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1    3
Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1    1
Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1    NaN
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1    3
Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1    NaN

I have loaded infile2 as a dataframe and named columns:

#Create the dataframe from the sequnce hits in the genomes (identified in the occurances file).
Occurences=pd.read_csv(infile2, delimiter='\t', index_col=False)    #Read the input file as a tab separated dataframe.
pd.set_option("display.max_colwidth", None) #Ensure that the sequence names are not cut off.
Occurences.rename(columns = {list(Occurences)[0]: 'sequence'}, inplace = True) #Name the sequences column
Occurences.rename(columns = {list(Occurences)[1]: 'hits'}, inplace = True) #Name the occurences column

I have tried to convert seqList to a data frame and then use .equals (as shown here) but this still reports the match as false:

SeqDataFrame= pd.DataFrame (seqList, columns = ['sequence']) #Load seqList as df
result = SeqDataFrame['sequence'].equals(Occurences['sequence'])  #Use .equals to compare the sequence columns and report matching
print(result)
False

I think that the issue is that the order of strings in the sequence column in the occurrences df is not in the same order as seqList. I have therefore tried to index the occurrences data frame using seqList, but this seems to lose all of the data in the hits column.

Occurences.set_index('sequence', inplace=True)
Occurences = Occurences.reindex(seqList)
print(Occurences)
                                                                             
    hits
sequence                                                                                  
Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1                                                      NaN
Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1                                                     NaN
Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1                                                     NaN
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1                                                  NaN
Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1                                                  NaN

I have looked for similar questions, but none seem to have an issue with the order of the columns not matching. And if it is a question specifically about columns not matching, they have reindexed as I have and haven't lost data. How do I create my desired matrix which contains all of the sequences in seqList and the number of hits identified in the Occurences data frame?

Many thanks in advance

n.b. I have also tried to use pd.merge to merge the list and data frame, but for some reason this creates an empty data frame:

MergedFrames = pd.merge(SeqDataFrame, Occurences, left_on=["sequence"], right_on=['sequence'])
print("MergedFrames")
print(MergedFrames)

MergedFrames
Empty DataFrame
Columns: [sequence, hits]
Index: []

Why is the number of hits 4 for sequence Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1 in the output? — user2246849
– user2246849, Commented May 24, 2022 at 13:38

user2246849 · Accepted Answer · 2022-05-24 13:53:05Z

1

You can use DataFrame.reindex:

Occurences.set_index('sequence').reindex(seqList).reset_index()

                                sequence  hits
0      Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1   3.0
1     Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1   1.0
2     Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1   NaN
3  Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1   3.0
4  Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1   NaN

If your list can have duplicates just use list(set(seqList)).

edited May 24, 2022 at 13:53

answered May 24, 2022 at 13:39

user2246849

4,4371 gold badge15 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jpike Over a year ago

Thank you for your suggestion, I have tried this, but I also get NaN for each row in 'hits'. Any ideas as to why? Occurances = Occurances.set_index('sequence').reindex(seqList).reset_index()

user2246849 Over a year ago

@Jpike are you sure the sequence values in Occurrences exactly match those in the list? Maybe you have trailing spaces? Try Occurrences['sequence'] = Occurrences['sequence'].str.strip() before this solution. Otherwise, check what is the output of Occurrences[Occurrences['sequence'].isin(seqList)]

Jpike Over a year ago

Occurrences['sequence'] = Occurrences['sequence'].str.strip() fixed it, there must have been trailing spaces. I will go back and look at the script i used to create the occurrences infile, I must have added a space somewhere. Many thanks for your help : )

M. Page · Accepted Answer · 2022-05-24 13:47:59Z

0

Supposing an element can appear several times in seqList:

seqDF = pd.DataFrame({'results': seqList})
df = pd.DataFrame({'diag': ['Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1', 'Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1','Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1'],
                 'occ': [1, 3, 3]})
mergeDF = seqDF.merge(df, how='left', left_on='results', right_on='diag')
mergeDF[['results', 'occ']].groupby('results')[['occ']].sum()

gives:

Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1   3.0
Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1  1.0
Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1  0.0
Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1   3.0
Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1   0.0

Since you want the number of occurrences, I have assumed that 0.0 is more coherent than NaN

answered May 24, 2022 at 13:47

M. Page

2,8342 gold badges23 silver badges35 bronze badges

3 Comments

Jpike Over a year ago

Thank you for your solution, I have tried this, replacing the data in diag and occ

seqDF = pd.DataFrame({'results': seqList}) df = pd.DataFrame({'diag': Occurences['sequence'],                  'occ': Occurences['hits']}) mergeDF = seqDF.merge(df, how='left', left_on='results', right_on='diag') mergeDF[['results', 'occ']].groupby('results')[['occ']].sum()

But I still get NaN for each value. Will referring to the Occurences df here not work?

Jpike Over a year ago

interesting that it doesn't even output 0.0. It is NaN. Does this mean that the strings don't match at all? Perhaps there is a hidden character?

M. Page Over a year ago

I guess there are hidden characters, because, on my side, it works. Are you aware that user2246849's solution won't work with repeated values in seqList ?

Collectives™ on Stack Overflow

Pandas: Create table from data frame matching columns to a list

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related