61

I have the below case statement in python,

pd_df['difficulty'] = 'Unknown'
pd_df['difficulty'][(pd_df['Time']<30) & (pd_df['Time']>0)] = 'Easy'
pd_df['difficulty'][(pd_df['Time']>=30) & (pd_df['Time']<=60)] = 'Medium'
pd_df['difficulty'][pd_df['Time']>60] = 'Hard'

But when I run the code, it throws an error.

A value is trying to be set on a copy of a slice from a DataFrame

4 Answers 4

115

Option 1
For performance, use a nested np.where condition. For the condition, you can just use pd.Series.between, and the default value will be inserted accordingly.

pd_df['difficulty'] = np.where(
     pd_df['Time'].between(0, 30, inclusive=False), 
    'Easy', 
     np.where(
        pd_df['Time'].between(0, 30, inclusive=False), 'Medium', 'Unknown'
     )
)

Option 2
Similarly, using np.select, this gives more room for adding conditions:

pd_df['difficulty'] = np.select(
    [
        pd_df['Time'].between(0, 30, inclusive=False), 
        pd_df['Time'].between(30, 60, inclusive=True)
    ], 
    [
        'Easy', 
        'Medium'
    ], 
    default='Unknown'
)

Option 3
Another performant solution involves loc:

pd_df['difficulty'] = 'Unknown'
pd_df.loc[pd_df['Time'].between(0, 30, inclusive=False), 'difficulty'] = 'Easy'
pd_df.loc[pd_df['Time'].between(30, 60, inclusive=True), 'difficulty'] = 'Medium'
Sign up to request clarification or add additional context in comments.

Comments

23

case_when

Since pandas 2.2.0, you can use case_when() on a column. Just initialize with the default value and replace values in it using case_when(), which accepts a list of (condition, replacement) tuples. For the example in the OP, we can use the following.

pd_df["difficulty"] = "Unknown"
pd_df["difficulty"] = pd_df["difficulty"].case_when([
    (pd_df.eval("0 < Time < 30"), "Easy"), 
    (pd_df.eval("30 <= Time <= 60"), "Medium"), 
    (pd_df.eval("Time > 60"), "Hard")
])

loc

OP's code only needed loc to correctly call the __setitem__() method via []. In particular, they already have used the proper brackets () to evaluate &-chained conditions individually.

The basic idea of this approach is to initialize a column with some default value (e.g. "Unknown") and update rows depending on condtions (e.g. "Easy" if 0<Time<30), etc.

When I time the options given on this page, for large frames, loc approach is the fastest (4-5 times faster than np.select and nested np.where).1.

pd_df['difficulty'] = 'Unknown'
pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'


1: Code used for benchmark.

def loc(pd_df):
    pd_df['difficulty'] = 'Unknown'
    pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
    pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
    pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'
    return pd_df

def np_select(pd_df):
    pd_df['difficulty'] = np.select([pd_df['Time'].between(0, 30, inclusive='neither'), pd_df['Time'].between(30, 60, inclusive='both'), pd_df['Time']>60], ['Easy', 'Medium', 'Hard'], 'Unknown')
    return pd_df

def nested_np_where(pd_df):
    pd_df['difficulty'] = np.where(pd_df['Time'].between(0, 30, inclusive='neither'), 'Easy', np.where(pd_df['Time'].between(30, 60, inclusive='both'), 'Medium', np.where(pd_df['Time'] > 60, 'Hard', 'Unknown')))
    return pd_df


df = pd.DataFrame({'Time': np.random.default_rng().choice(120, size=15_000_000)-30})

%timeit loc(df.copy())
# 891 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_select(df.copy())
# 3.93 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nested_np_where(df.copy())
# 4.82 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comments

6

For the sake of readability "match-case" syntax is introduced since python 3.10.
https://peps.python.org/pep-0622/

import pandas as pd

# Sample data
data = {
    'Time': [15, 45, 70, -5, 0, 30, 60]
    }
pd_df = pd.DataFrame(data)

def classify_difficulty(time):
    match time:
        case _ if 0 < time < 30:
            return 'Easy'
        case _ if 30 <= time <= 60:
            return 'Medium'
        case _ if time > 60:
            return 'Hard'
        case _:
            return 'Unknown'

pd_df['difficulty'] = pd_df['Time'].apply(classify_difficulty)

print(pd_df)
   Time difficulty
0    15       Easy
1    45     Medium
2    70       Hard
3    -5    Unknown
4     0    Unknown
5    30     Medium
6    60     Medium

Comments

3
import numpy as np
import pandas as pd

def case_when(*args):
    return np.select(
        condlist = args[::2],
        choicelist = args[1::2],
        default=pd.NA
    )

df = pd.DataFrame({"cola":["a","b","a","a","c","d","d","e","c"],
                   "colb":range(9)})

df["newcol"] = case_when(df["cola"] == "a","ap",
                         df["colb"] == 0, "x", # Not taken because it's after the first line
                         df["colb"] == 1, "y",
                         True, df["cola"]
                         )

df["newcolb"] = case_when(df["cola"] == "e",1,
                          df["colb"] == 8, 2
                          )

df

#   cola  colb newcol newcolb
# 0    a     0     ap    <NA>
# 1    b     1      y    <NA>
# 2    a     2     ap    <NA>
# 3    a     3     ap    <NA>
# 4    c     4      c    <NA>
# 5    d     5      d    <NA>
# 6    d     6      d    <NA>
# 7    e     7      e       1
# 8    c     8      c       2

For the sake of readability, like in R.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.