3

I have the following dataframe:

df = pd.DataFrame({'TX':['bob','tim','frank'],'IL':['fred','bob','tim'],'NE':['tim','joe','bob']})

I would like to isolate the strings that occur across all columns to generate a list. The expected result is:

output = ['tim','bob']

The only way I can think to achieve this is using for loops which I would like to avoid. Is there a built-in pandas function suited to accomplishing this?

0

3 Answers 3

8

You can create mask for count values per columns and test if not missing values per rows by DataFrame.all:

m = df.apply(pd.value_counts).notna()
print (m)
          TX     IL     NE
bob     True   True   True
frank   True  False  False
fred   False   True  False
joe    False  False   True
tim     True   True   True

L = m.index[m.all(axis=1)].tolist()
print (L)
['bob', 'tim']
Sign up to request clarification or add additional context in comments.

Comments

1

You can achieve this by pandas.DataFrame.apply() and set.intersection(), like this:

cols_set = list(df.apply(lambda col: set(col.values)).values)
output = list(set.intersection(*cols_set))

The result is following:

>>> print(output)
['tim', 'bob']

3 Comments

list(set.intersection(*[set(col) for col in df.values])). Summary of the above answer. Achieves the same result in lesser amount of code.
@nishant, thank you for your comment. However, you are not right. The code could be used for a problem from the question only if it looked like this: list(set.intersection(*[set(col) for col in df.values.T])). The author of the question is interested in values common for every column, not row! Next time, please, read the question carefully.
@Jaroslav..yes you are correct. I missed df.T i.e. transpose of the data frame. Actual code would be list(set.intersection(*[set(col) for col in df.T.values]))
1

IIUC,

you can stack all your columns vertically and then do a value_counts to count the occurrences of each item, we'll do that in the variable called s

we then want all occurrences of the names which are equal to the max number of occurrences, in this instance 3, the column values are now indices thanks to using stack

s = df.stack().value_counts()
# or if you want to ignore duplicates column wise
#df.stack().groupby(level=1).unique().explode().value_counts()

print(s)

tim      3
bob      3
frank    1
fred     1
joe      1

s1 = s[s.eq(s.max())].index.tolist()

print(s1)

['tim', 'bob']

3 Comments

Please explain.
This might fail if the same value appears in one column more than once. For example say, bob appeared twice in the first column, df.stack.value_counts() would give 4 for 'bob' and thus s1 would only return ['bob'] which is wrong according to the question.
correct @nishant but in the absence of any feedback from OP it's hard to say what's wrong or right, the above could be corrected by df.stack().groupby(level=1).unique().explode().value_counts()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.