Pandas: find common values across columns

Question

I have the following dataframe:

df = pd.DataFrame({'TX':['bob','tim','frank'],'IL':['fred','bob','tim'],'NE':['tim','joe','bob']})

I would like to isolate the strings that occur across all columns to generate a list. The expected result is:

output = ['tim','bob']

The only way I can think to achieve this is using for loops which I would like to avoid. Is there a built-in pandas function suited to accomplishing this?

jezrael · Accepted Answer · 2020-03-23 11:34:08Z

8

You can create mask for count values per columns and test if not missing values per rows by DataFrame.all:

m = df.apply(pd.value_counts).notna()
print (m)
          TX     IL     NE
bob     True   True   True
frank   True  False  False
fred   False   True  False
joe    False  False   True
tim     True   True   True

L = m.index[m.all(axis=1)].tolist()
print (L)
['bob', 'tim']

answered Mar 23, 2020 at 11:34

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jaroslav Bezděk · Accepted Answer · 2020-03-22 19:12:03Z

1

You can achieve this by pandas.DataFrame.apply() and set.intersection(), like this:

cols_set = list(df.apply(lambda col: set(col.values)).values)
output = list(set.intersection(*cols_set))

The result is following:

>>> print(output)
['tim', 'bob']

edited Mar 22, 2020 at 19:12

answered Mar 22, 2020 at 19:04

Jaroslav Bezděk

7,7156 gold badges34 silver badges59 bronze badges

3 Comments

nishant Over a year ago

list(set.intersection(*[set(col) for col in df.values])). Summary of the above answer. Achieves the same result in lesser amount of code.

Jaroslav Bezděk Over a year ago

@nishant, thank you for your comment. However, you are not right. The code could be used for a problem from the question only if it looked like this: list(set.intersection(*[set(col) for col in df.values.T])). The author of the question is interested in values common for every column, not row! Next time, please, read the question carefully.

nishant Over a year ago

@Jaroslav..yes you are correct. I missed df.T i.e. transpose of the data frame. Actual code would be list(set.intersection(*[set(col) for col in df.T.values]))

Umar.H · Accepted Answer · 2020-03-23 11:23:32Z

1

IIUC,

you can stack all your columns vertically and then do a value_counts to count the occurrences of each item, we'll do that in the variable called s

we then want all occurrences of the names which are equal to the max number of occurrences, in this instance 3, the column values are now indices thanks to using stack

s = df.stack().value_counts()
# or if you want to ignore duplicates column wise
#df.stack().groupby(level=1).unique().explode().value_counts()

print(s)

tim      3
bob      3
frank    1
fred     1
joe      1

s1 = s[s.eq(s.max())].index.tolist()

print(s1)

['tim', 'bob']

edited Mar 23, 2020 at 11:23

answered Mar 22, 2020 at 18:51

Umar.H

23.1k8 gold badges50 silver badges94 bronze badges

3 Comments

vonbrand Over a year ago

Please explain.

nishant Over a year ago

This might fail if the same value appears in one column more than once. For example say, bob appeared twice in the first column, df.stack.value_counts() would give 4 for 'bob' and thus s1 would only return ['bob'] which is wrong according to the question.

Umar.H Over a year ago

correct @nishant but in the absence of any feedback from OP it's hard to say what's wrong or right, the above could be corrected by df.stack().groupby(level=1).unique().explode().value_counts()

Collectives™ on Stack Overflow

Pandas: find common values across columns

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related