2

In a data frame, I want to create another column which is outputs the most frequent value coming from different columns in a row.

A    B    C   D
foo  bar  baz foo
egg  bacon egg egg
bacon egg foo  baz

The "E" column must output frequent value from a row like

E
foo
egg

How can I do it in Python?

2
  • 1
    Mode with axis=1? Commented Dec 4, 2020 at 15:27
  • Didn't understand the output. Commented Dec 4, 2020 at 15:43

1 Answer 1

3

Recreating your problem with:

df = pd.DataFrame(
    {
        'A' : ['foo', 'egg', 'bacon'], 
        'B' : ['bar', 'bacon', 'egg'],
        'C' : ['baz', 'egg', 'foo'],
        'D' : ['foo', 'egg', 'baz']
    }
)

And solving the problem with

df['E'] = df.mode(axis=1)[0]

Output:

    A      B       C       D       E
0   foo    bar     baz     foo     foo
1   egg    bacon   egg     egg     egg
2   bacon  egg     foo     baz     bacon

What happens if there is no single most frequent element?

df.mode(axis=1)
    0      1       2       3
0   foo    NaN     NaN     NaN
1   egg    NaN     NaN     NaN
2   bacon  baz     egg     foo

As you can see when there is a tie on being most frequent it returns the values in the most frequent set. If I swap the values foo for egg and baz for bacon in columns C and D, respectively, we get the following result:

    0      1
0   foo    NaN
1   egg    NaN
2   bacon  egg

As you can see, now the result set is only two elements, which means that the tie is between bacon and egg.

How do I detect ties?

Let us work with the dataset not containing the column D.

df
    A      B       C
0   foo    bar     baz
1   egg    bacon   egg
2   bacon  egg     foo

df_m = df.mode(axis=1)
df_m
    0      1    2
0   bar    baz  foo
1   egg    NaN  NaN
2   bacon  egg  foo

df['D'] = df_m[0]
    A      B       C    D
0   foo    bar     baz  bar
1   egg    bacon   egg  egg
2   bacon  egg     foo  bacon

We can utilize the notna() method which pandas provide to create a mask to check which rows are not containing a NaN value, i.e which rows are in a tie.

First, we must drop the first column which always has a value.

df_m = df_m.drop(columns=0)

Then we need to transform the dataframe using another method .T, and check for any rows not containing NaNs.

df_mask = df_m.T.notna().any()
df_mask
0    False
1    False
2     True
dtype: bool

Now we have a pandas series of booleans. We can use this mask to overwrite the column from before.

df['D'][df_mask] = df['A'][df_mask] 
    A      B       C    D
0   foo    bar     baz  foo
1   egg    bacon   egg  egg
2   bacon  egg     foo  bacon
Sign up to request clarification or add additional context in comments.

5 Comments

If no frequent value found, it must return the value from the first column's row? In this case, it returns an alphabetically lower value.
No, mode works by returning a set of values, i.e if it has a tie for most frequent it will return 2 values this can be seen in the raw output of mode (updated example). The way I implemented it is by simply taking the first element, since you have not specified otherwise.
I have used this formula. And, column A has a value "Foo", column B has a value "Bar", and column C value is "Baz". In this case, the above formula will output "Bar" but it should output "Foo" in that case. If every value is unique, the output should be the value from first column.
From the documentation this is not an option. To achieve this you need to check for NA's in the output data frame.
I updated my answer to fit your problem :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.