-2

I wanted to use column values in one csv file to mask rows in another csv, as in:

df6 = pd.read_csv(‘py_all1a.csv’) # file with multiple columns
df7 = pd.read_csv(‘artexclude1.csv’) # file with multiple columns
#    
#  csv df6 col 1 has the same header and data type as col 8 in df7.   
#  I want to mask rows in df6 that have a matching col value to any
#  in df7. The data in each column is a text value (single word).
#   
mask = df6.iloc[:,1].isin(df7.iloc[:,8]) 

df6[~mask].to_csv(‘py_all1b.csv’, index=False) 
#    

On that last line, I tried [mask] with the tilde, resulting in no change to the df6 file (py_all1b.csv), and without the tilde (producing the file with just the column headers).

An answer using a specific data set was provided in the below answer, but it did not work because there were inconsistencies between the text values, namely, on entry had a space while another did not.

The below answer is correct, and I have added a paragraph to show how the text issue can also be resolved.

3
  • Can you reduce this to a minimal reproducible example please? Commented Feb 19, 2018 at 3:04
  • I would like to, but no I cannot. As I stated in the answer comments below, the answer posted below is correct (I did check it manually and it did work when entered that way, so I will mark it correct. However, the solution (for some reason) did not actually fix the problem for the files I am using. I have no idea why. Commented Feb 19, 2018 at 3:54
  • I found the solution to my problem and posted an edit to the below answer. Commented Feb 19, 2018 at 4:33

1 Answer 1

1

Try converting to a set first:

mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))

This ensures your comparison is against values.

Example

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
#     0   1   2
# 0   1   2   3
# 1   4   5   6
# 2   7   8   9
# 3  10  11  12

df2 = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
#    0  1  2
# 0  1  2  3
# 1  1  2  3
# 2  1  2  3
# 3  1  2  3

mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))

df1[mask]
#    0  1  2
# 0  1  2  3

With strings

It still works:

df1 = pd.DataFrame([['a', 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
df2 = pd.DataFrame([['a', 2, 3], ['a', 2, 3], ['a', 2, 3], ['a', 2, 3]])

mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))

df1[mask]

#    0  1  2
# 0  a  2  3

When you are dealing with string data, there may be problems with whitespace that can cause matches to be missed. As described in this answer, you may need to instead use:

df6 = pd.read_csv('py_all1a.csv', skipinitialspace=True) # file with multiple columns
df7 = pd.read_csv('artexclude1.csv', skipinitialspace=True) # file with multiple columns
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
df6[~mask].to_csv('py_all1b.csv', index=False)
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, but that did not work either. I manually checked the files to verify it is not correctly masking.
My example has string values rather than integers. Might that be part of the issue?
@user3447273. No, strings should be fine. Feel free to change my example to see for yourself.
Your example does work, and one I did with a small set of values manually entered, also worked. For some unknown reason, when I went back to my actual (larger) files, it did not work. The actual file has 40 different values to compare against a list with 550 values. After running it, some of the 40 (df7) were still found in the list of 550 (df6).
It does work. I've added an extra example. If you still believe it "doesn't work", please provide a minimal reproducible example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.