1

I have two dataframes:

data = pd.DataFrame({"A": np.repeat(np.arange(1.,11.),50), 
                    "B": np.tile(np.repeat(np.arange(0.,5.),10),10), 
                    "C":np.arange(500)})
bad_data = pd.DataFrame({"A": [1., 2., 7., 9.], 
                           "B": [0., 3., 0., 2.], 
                           "points": [[0, 1],[0],[1],[0,1]]})
data.head(15)
bad_data
>>> data.head(15)
      A    B   C
0   1.0  0.0   0
1   1.0  0.0   1
2   1.0  0.0   2
3   1.0  0.0   3
4   1.0  0.0   4
5   1.0  0.0   5
6   1.0  0.0   6
7   1.0  0.0   7
8   1.0  0.0   8
9   1.0  0.0   9
10  1.0  1.0  10
11  1.0  1.0  11
12  1.0  1.0  12
13  1.0  1.0  13
14  1.0  1.0  14
>>> bad_data
     A    B  points
0  1.0  0.0  [0, 1]
1  2.0  3.0     [0]
2  7.0  0.0     [1]
3  9.0  2.0  [0, 1]

For each row of data, I want to drop all rows in bad_data with the same A and B, and indexed by the values of points. For example, the first row of bad_data tells me that I need to drop the first two rows of data:

      A    B   C
0   1.0  0.0   0
1   1.0  0.0   1

How can I do that? I was able to cook up this horror, but it's quite ugly to read. Can you help me write a more Pythonic/readable solution?

rows_to_remove = []
for A, B in zip(bad_data['A'], bad_data['B']):
    rows_in_data = (data['A'] == A) & (data['B'] == B)
    rows_in_bad_data = (bad_data['A'] == A) & (bad_data['B'] == B)
    bad_points = bad_data.loc[rows_in_bad_data, 'points'].values[0]
    indices = data[rows_in_data].index.values[bad_points]
    rows_to_remove.extend(indices)
    print(rows_to_remove)
data.drop(data.index[rows_to_remove], inplace=True)
3
  • 1
    do the values in "point" refer to the absolute index? relative to the group? value in "C"? Commented Mar 18, 2022 at 15:33
  • @mozway They refer to the index relative to the group defined by the current values of A and B. Maybe, to avoid confusion with the values of C, I can define C as "C":np.random.random(500). This way, it's obvious that [0,1] cannot be a subset of values of C. What do you think? Would that make the question more readable? Commented Mar 18, 2022 at 15:39
  • @DeltaV thanks, this is clear, I provided an answer ;) Commented Mar 18, 2022 at 15:53

2 Answers 2

1

IIUC, you could perform a reverse merge on the exploded bad_data:

data2 = (data
.assign(points=data.groupby(['A', 'B']).cumcount())  # get index per group (=points)
.merge(bad_data.explode('points'), on=['A', 'B', 'points'], # outer merge
       indicator=True, how='outer')
.loc[lambda d: d['_merge'].eq('left_only')]  # keep the rows unique to the left
.drop(columns=['points', '_merge'])          # remove helper columns
)

Another option is to use GroupBy.apply:

# craft a Series of list of points indexed by A/B
s = bad_data.set_index(['A', 'B'])['points']
    # group by A/B
data2 = (data
     .groupby(['A', 'B'], as_index=False, group_keys=False)
     # get the real index names from "index" and drop if the key is present in s
     # else leave the group unchanged
     .apply(lambda g: g.drop(g.index[s.loc[g.name]]) if g.name in s else g)
)

Both approaches yield the same dataframe as your custom code.

output shape:

data2.shape
# (494, 3)

Details on second solution:

  • craft a Series s to be like:
A    B  
1.0  0.0    [0, 1]
2.0  3.0       [0]
7.0  0.0       [1]
9.0  2.0    [0, 1]
Name: points, dtype: object
  • make groups by A/B
  • for each group, if it is present in the index of s (the key is g.name), fetch the values s.loc[g.name], get the matching indices from the relative position in the group: g.index[s.loc[g.name]], feed this to drop. If the A/B index is absent, return the group unchanged.
Sign up to request clarification or add additional context in comments.

6 Comments

@DeltaIV yes, it should be data, my bad, the habit of using df, let me have a look at your update
@DeltaIV I see the "issue", my solutions are not in place, you need to assign the output. I made this explicit ;)
this is useful. Would it be possible for you to also show, optionally, how to make the modifications in place? Or would "inplacing" require too much modifications to your code? Is "inplacing" frowned upon as a bad practice? It would seem very convenient to me.
Most "inplace" function are actually not in place and use copies. There is an active discussion whether or not to deprecate the inplace parameters in pandas. If you really need to modify "in place", you can modify my code to yield the indices to drop using .loc[lambda d: d['_merge'].ne('left_only')].index in place of the last 2 lines and assigning to a idx variable, then data.drop(idx, inplace=True).
@DeltaIV I added details, hope it helps ;)
|
0

Not 100% if I understood this correctly or my attempt is the most elegant way, so let me know if this works for you:

bad_indexes = []
labels = ['A', 'B']

for _, s in bad_data.iterrows():
    p = data.loc[s['points']]
    p = p[p[labels].eq(s[labels]).all(1)]
    bad_indexes.extend(p.index)

result = data.loc[data.index.difference(bad_indexes)]

I assumed that the index of data has unique values.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.