I have two dataframes:
data = pd.DataFrame({"A": np.repeat(np.arange(1.,11.),50),
"B": np.tile(np.repeat(np.arange(0.,5.),10),10),
"C":np.arange(500)})
bad_data = pd.DataFrame({"A": [1., 2., 7., 9.],
"B": [0., 3., 0., 2.],
"points": [[0, 1],[0],[1],[0,1]]})
data.head(15)
bad_data
>>> data.head(15)
A B C
0 1.0 0.0 0
1 1.0 0.0 1
2 1.0 0.0 2
3 1.0 0.0 3
4 1.0 0.0 4
5 1.0 0.0 5
6 1.0 0.0 6
7 1.0 0.0 7
8 1.0 0.0 8
9 1.0 0.0 9
10 1.0 1.0 10
11 1.0 1.0 11
12 1.0 1.0 12
13 1.0 1.0 13
14 1.0 1.0 14
>>> bad_data
A B points
0 1.0 0.0 [0, 1]
1 2.0 3.0 [0]
2 7.0 0.0 [1]
3 9.0 2.0 [0, 1]
For each row of data, I want to drop all rows in bad_data with the same A and B, and indexed by the values of points. For example, the first row of bad_data tells me that I need to drop the first two rows of data:
A B C
0 1.0 0.0 0
1 1.0 0.0 1
How can I do that? I was able to cook up this horror, but it's quite ugly to read. Can you help me write a more Pythonic/readable solution?
rows_to_remove = []
for A, B in zip(bad_data['A'], bad_data['B']):
rows_in_data = (data['A'] == A) & (data['B'] == B)
rows_in_bad_data = (bad_data['A'] == A) & (bad_data['B'] == B)
bad_points = bad_data.loc[rows_in_bad_data, 'points'].values[0]
indices = data[rows_in_data].index.values[bad_points]
rows_to_remove.extend(indices)
print(rows_to_remove)
data.drop(data.index[rows_to_remove], inplace=True)
AandB. Maybe, to avoid confusion with the values ofC, I can defineCas"C":np.random.random(500). This way, it's obvious that[0,1]cannot be a subset of values ofC. What do you think? Would that make the question more readable?