Creating a new dataframe in pandas based on matching row data

Question

So I have two rather large excel file that I have converted into two dataframes (df for the current week & df2 for the previous week.). There are a total of 128 rows that are identical in both of the dataframes, so I've used created a new variable:

onlyWon = df.loc[df['Sales stage'] == "Won"]

Thereafter, I am trying to create a new dataframe that only contains the values in df2 that match the Sales number in the onlyWon dataframe. For example, if I were to do this with only one item the code would be:

df2.loc[df2['Sales No'] == "B3M-RB-03"])

Which works for one column, but when I try to for example iterate over the onlyWon dataframe and append the data to a new dataframe, I run into errors.

Examples on how I want it to work:

DF2:

+------------------+----------+-------------+-----------+
|     Customer     | Sales No | Sales Stage | Deal Size |
+------------------+----------+-------------+-----------+
| Stackoverflow    | A1       | Identified  |       100 |
| Guido van Rossum | B2       | Lost        |      1000 |
+------------------+----------+-------------+-----------+

OnlyWon:

+---------------+----------+-------------+-----------+
|   Customer    | Sales No | Sales Stage | Deal Size |
+---------------+----------+-------------+-----------+
| Stackoverflow | A1       | WON         |       100 |
+---------------+----------+-------------+-----------+

New dataframe:

+---------------+----------+-------------+-----------+
|   Customer    | Sales No | Sales Stage | Deal Size |
+---------------+----------+-------------+-----------+
| Stackoverflow | A1       | Identified  |       100 |
+---------------+----------+-------------+-----------+

What I tried to do

Declaring a new empty dataframe (df3) that contains all the same headers, but is empty.

Creating a list out of all the 'Sales No':

onlyWonSales = []
for salesNo in onlyWon['Sales No']:
    onlyWonSales.append(salesNo)

Then looping over the list and appending to the new dataframe:

for item in onlyWonSales:
    df3 = df3.append(df2.loc[df2['Sales No'] == item)

This adds a lot of duplicates and doesn't work (even though it doesn't create any errors (The onlyWonSales list is around 1000 and the df3 is around 4000).

@komatiraju032, what I tried to do was to create a list out of all the sales numbers in the OnlyWon dataframe by doing: ` onlyWonSales = [] for SalesNo in onlyWon['Sales No']: onlyWonSales.append(SalesNo) ` This work by adding all the sales numbers in a list. (I get 1000 when doing len(onlyWonSales). Then I try to do: ` for item in onlyWonSales: df3 = df3.append(df2.loc[df2['Sales No'] == item]) ` Which causes a lot of duplicates and stuff to be added (around 4000). — vetle101
– vetle101, Commented Apr 23, 2020 at 19:54
@komatiraju032 I've updated my post to include what I did with better formatting. — vetle101
– vetle101, Commented Apr 23, 2020 at 20:07

Mayank Porwal · Accepted Answer · 2020-04-23 20:55:45Z

1

Like this:

In [150]: new = pd.merge(df2, onlywon, on=['Sales No'], suffixes=('', '_y'))

In [153]: new.drop(list(new.filter(regex='_y$')), axis=1, inplace=True)                                                                                                                                     

In [154]: new                                                                                                                                                                                               
Out[154]: 
        Customer Sales No Sales Stage  Deal Size
0  Stackoverflow       A1  Identified        100

edited Apr 23, 2020 at 20:55

answered Apr 23, 2020 at 20:19

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mayank Porwal Over a year ago

Just run new = new.drop_duplicates() and check the shape.

jcaliz · Accepted Answer · 2020-04-23 21:18:17Z

0

Leave onlyWon then do a query

 onlyWon = df.loc[df['Sales stage'] == "Won"]

 sales_no_won = onlyWon['Sales No']
 reults = df2.query('`Sales No` in @sales_no_won').copy()

edited Apr 23, 2020 at 21:18

answered Apr 23, 2020 at 20:16

jcaliz

4,0732 gold badges11 silver badges16 bronze badges

4 Comments

vetle101 Over a year ago

That produces a key error: raise KeyError(f"{not_found} not in index")

jcaliz Over a year ago

Right, it was missing the suffixes, sorry

vetle101 Over a year ago

Unfortunately that didn't work... Still has a lot of duplicated 'Sales No' in the Results dataframe. The OnlyWon dataframe has only unique values in the Sales No columns, so it should only capture those who have the matching Sales No in the DF2.

jcaliz Over a year ago

So sorry to read that, try with the edit using query method, I think for this case is cleaner.

Collectives™ on Stack Overflow

Creating a new dataframe in pandas based on matching row data

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related