Dataframe failed to mask rows due to string values

Question

I wanted to use column values in one csv file to mask rows in another csv, as in:

df6 = pd.read_csv(‘py_all1a.csv’) # file with multiple columns
df7 = pd.read_csv(‘artexclude1.csv’) # file with multiple columns
#    
#  csv df6 col 1 has the same header and data type as col 8 in df7.   
#  I want to mask rows in df6 that have a matching col value to any
#  in df7. The data in each column is a text value (single word).
#   
mask = df6.iloc[:,1].isin(df7.iloc[:,8]) 

df6[~mask].to_csv(‘py_all1b.csv’, index=False) 
#

On that last line, I tried [mask] with the tilde, resulting in no change to the df6 file (py_all1b.csv), and without the tilde (producing the file with just the column headers).

An answer using a specific data set was provided in the below answer, but it did not work because there were inconsistencies between the text values, namely, on entry had a space while another did not.

The below answer is correct, and I have added a paragraph to show how the text issue can also be resolved.

Can you reduce this to a minimal reproducible example please? — cs95
– cs95, Commented Feb 19, 2018 at 3:04
I would like to, but no I cannot. As I stated in the answer comments below, the answer posted below is correct (I did check it manually and it did work when entered that way, so I will mark it correct. However, the solution (for some reason) did not actually fix the problem for the files I am using. I have no idea why. — user3447273
– user3447273, Commented Feb 19, 2018 at 3:54
I found the solution to my problem and posted an edit to the below answer. — user3447273
– user3447273, Commented Feb 19, 2018 at 4:33

user3447273 · Accepted Answer · 2018-02-19 09:08:14Z

1

Try converting to a set first:

mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))

This ensures your comparison is against values.

Example

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
#     0   1   2
# 0   1   2   3
# 1   4   5   6
# 2   7   8   9
# 3  10  11  12

df2 = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
#    0  1  2
# 0  1  2  3
# 1  1  2  3
# 2  1  2  3
# 3  1  2  3

mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))

df1[mask]
#    0  1  2
# 0  1  2  3

With strings

It still works:

df1 = pd.DataFrame([['a', 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
df2 = pd.DataFrame([['a', 2, 3], ['a', 2, 3], ['a', 2, 3], ['a', 2, 3]])

mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))

df1[mask]

#    0  1  2
# 0  a  2  3

When you are dealing with string data, there may be problems with whitespace that can cause matches to be missed. As described in this answer, you may need to instead use:

df6 = pd.read_csv('py_all1a.csv', skipinitialspace=True) # file with multiple columns
df7 = pd.read_csv('artexclude1.csv', skipinitialspace=True) # file with multiple columns
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
df6[~mask].to_csv('py_all1b.csv', index=False)

edited Feb 19, 2018 at 9:08

user3447273

3611 gold badge4 silver badges15 bronze badges

answered Feb 19, 2018 at 2:46

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3447273 Over a year ago

Thanks, but that did not work either. I manually checked the files to verify it is not correctly masking.

user3447273 Over a year ago

My example has string values rather than integers. Might that be part of the issue?

jpp Over a year ago

@user3447273. No, strings should be fine. Feel free to change my example to see for yourself.

user3447273 Over a year ago

Your example does work, and one I did with a small set of values manually entered, also worked. For some unknown reason, when I went back to my actual (larger) files, it did not work. The actual file has 40 different values to compare against a list with 550 values. After running it, some of the 40 (df7) were still found in the list of 550 (df6).

jpp Over a year ago

It does work. I've added an extra example. If you still believe it "doesn't work", please provide a minimal reproducible example.

Collectives™ on Stack Overflow

Dataframe failed to mask rows due to string values

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related