1

How can I test whether there are values shared in multiple columns of a pandas DataFrame? For example, this is OK:

      A    B    C
0   aaa  fff  lll
1   bbb  ggg  mmm
2   ccc  hhh  nnn
3   ddd  iii  ooo
4   eee  jjj  ppp

but this is not

      A    B    C
0   aaa  fff  lll
1   bbb  ggg  mmm
2   ccc  hhh  nnn
3   ddd  iii  bbb
4   eee  jjj  ppp

because bbb exists in multiple columns (A and C).

2
  • Are all your columns of the same obj data type? And are you expecting something like a boolean array that corresponds to each unique value with some sort of "True/False this value is represented in 2+ columns" indication? Commented Feb 1, 2018 at 16:01
  • Yeah, consistent datatypes (integers). Just used strings here to make it easier to look at. I don't need a fancy boolean map of offending cells, but I would like to get out which values are offenders, and which columns they exist in. Commented Feb 1, 2018 at 16:04

1 Answer 1

1

Get intersection between all combination of columns first, convert to numpy array, then to boolean and test at least one True:

from itertools import combinations
a = [set(df[i[0]]) & set(df[i[1]]) for i in combinations(df.columns,2)]
b = np.array(a).astype(bool).any()

For first df:

print (a)
[set(), set(), set()]

print (b)
False

For second df:

print (a)
[set(), {'bbb'}, set()]

print (b)
True

For more information is possible use (untested):

d = {i:set(df[i[0]]) & set(df[i[1]]) for i in combinations(df.columns,2)}

s = pd.Series(d)

s = s[s.astype(bool)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.