drop rows from a pandas dataframe based on list of points in another dataframe

Question

I have two dataframes:

data = pd.DataFrame({"A": np.repeat(np.arange(1.,11.),50), 
                    "B": np.tile(np.repeat(np.arange(0.,5.),10),10), 
                    "C":np.arange(500)})
bad_data = pd.DataFrame({"A": [1., 2., 7., 9.], 
                           "B": [0., 3., 0., 2.], 
                           "points": [[0, 1],[0],[1],[0,1]]})
data.head(15)
bad_data
>>> data.head(15)
      A    B   C
0   1.0  0.0   0
1   1.0  0.0   1
2   1.0  0.0   2
3   1.0  0.0   3
4   1.0  0.0   4
5   1.0  0.0   5
6   1.0  0.0   6
7   1.0  0.0   7
8   1.0  0.0   8
9   1.0  0.0   9
10  1.0  1.0  10
11  1.0  1.0  11
12  1.0  1.0  12
13  1.0  1.0  13
14  1.0  1.0  14
>>> bad_data
     A    B  points
0  1.0  0.0  [0, 1]
1  2.0  3.0     [0]
2  7.0  0.0     [1]
3  9.0  2.0  [0, 1]

For each row of data, I want to drop all rows in bad_data with the same A and B, and indexed by the values of points. For example, the first row of bad_data tells me that I need to drop the first two rows of data:

      A    B   C
0   1.0  0.0   0
1   1.0  0.0   1

How can I do that? I was able to cook up this horror, but it's quite ugly to read. Can you help me write a more Pythonic/readable solution?

rows_to_remove = []
for A, B in zip(bad_data['A'], bad_data['B']):
    rows_in_data = (data['A'] == A) & (data['B'] == B)
    rows_in_bad_data = (bad_data['A'] == A) & (bad_data['B'] == B)
    bad_points = bad_data.loc[rows_in_bad_data, 'points'].values[0]
    indices = data[rows_in_data].index.values[bad_points]
    rows_to_remove.extend(indices)
    print(rows_to_remove)
data.drop(data.index[rows_to_remove], inplace=True)

do the values in "point" refer to the absolute index? relative to the group? value in "C"? — mozway
– mozway, Commented Mar 18, 2022 at 15:33
@mozway They refer to the index relative to the group defined by the current values of A and B. Maybe, to avoid confusion with the values of C, I can define C as "C":np.random.random(500). This way, it's obvious that [0,1] cannot be a subset of values of C. What do you think? Would that make the question more readable? — DeltaIV
– DeltaIV, Commented Mar 18, 2022 at 15:39

mozway · Accepted Answer · 2022-03-22 10:41:51Z

1

IIUC, you could perform a reverse merge on the exploded bad_data:

data2 = (data
.assign(points=data.groupby(['A', 'B']).cumcount())  # get index per group (=points)
.merge(bad_data.explode('points'), on=['A', 'B', 'points'], # outer merge
       indicator=True, how='outer')
.loc[lambda d: d['_merge'].eq('left_only')]  # keep the rows unique to the left
.drop(columns=['points', '_merge'])          # remove helper columns
)

Another option is to use GroupBy.apply:

# craft a Series of list of points indexed by A/B
s = bad_data.set_index(['A', 'B'])['points']
    # group by A/B
data2 = (data
     .groupby(['A', 'B'], as_index=False, group_keys=False)
     # get the real index names from "index" and drop if the key is present in s
     # else leave the group unchanged
     .apply(lambda g: g.drop(g.index[s.loc[g.name]]) if g.name in s else g)
)

Both approaches yield the same dataframe as your custom code.

output shape:

data2.shape
# (494, 3)

Details on second solution:

craft a Series s to be like:

A    B  
1.0  0.0    [0, 1]
2.0  3.0       [0]
7.0  0.0       [1]
9.0  2.0    [0, 1]
Name: points, dtype: object

make groups by A/B
for each group, if it is present in the index of s (the key is g.name), fetch the values s.loc[g.name], get the matching indices from the relative position in the group: g.index[s.loc[g.name]], feed this to drop. If the A/B index is absent, return the group unchanged.

edited Mar 22, 2022 at 10:41

answered Mar 18, 2022 at 15:45

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

mozway Over a year ago

@DeltaIV yes, it should be data, my bad, the habit of using df, let me have a look at your update

mozway Over a year ago

@DeltaIV I see the "issue", my solutions are not in place, you need to assign the output. I made this explicit ;)

DeltaIV Over a year ago

this is useful. Would it be possible for you to also show, optionally, how to make the modifications in place? Or would "inplacing" require too much modifications to your code? Is "inplacing" frowned upon as a bad practice? It would seem very convenient to me.

mozway Over a year ago

Most "inplace" function are actually not in place and use copies. There is an active discussion whether or not to deprecate the inplace parameters in pandas. If you really need to modify "in place", you can modify my code to yield the indices to drop using .loc[lambda d: d['_merge'].ne('left_only')].index in place of the last 2 lines and assigning to a idx variable, then data.drop(idx, inplace=True).

mozway Over a year ago

@DeltaIV I added details, hope it helps ;)

|

timgeb · Accepted Answer · 2022-03-18 15:43:54Z

0

Not 100% if I understood this correctly or my attempt is the most elegant way, so let me know if this works for you:

bad_indexes = []
labels = ['A', 'B']

for _, s in bad_data.iterrows():
    p = data.loc[s['points']]
    p = p[p[labels].eq(s[labels]).all(1)]
    bad_indexes.extend(p.index)

result = data.loc[data.index.difference(bad_indexes)]

I assumed that the index of data has unique values.

answered Mar 18, 2022 at 15:43

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

Collectives™ on Stack Overflow

drop rows from a pandas dataframe based on list of points in another dataframe

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related