1

I've searched some previous answers about this, but I want to know one thing more about it.

This is the test data:

df = pd.DataFrame({"a":[2,3,4,5,6,8],"b":[3,4,np.nan,6,111,22], "c" : [2,3,4,5, 777,1]})

And this is the one I'm working with to check if the value is outliers per column

def check_outliers(df, domain_list):
    
    outlier_column = []
    for domain in domain_list:
        Q1 = df[domain].quantile(0.25)
        Q3 = df[domain].quantile(0.75)
        IQR = Q3 - Q1
        min_v = Q1 - (1.5 * IQR)
        max_v = Q3 + (1.5 * IQR)
    
        df["No_outliers_"+domain] = np.where(np.isnan(df[domain]),"-",np.where((df[domain] >= min_v) & (df[domain] <= max_v), "O",domain))
        outlier_column.append("No_outliers_"+domain)
    
    #df["No_outliers"] = np.where()

    df = df.drop(outlier_column, axis=1)
    return df

df = check_outliers(df,["a","b","c"])

I see that many recommends to use np.where or np.select for this, but what I want to know more is that dealing with multiple columns for the condition. I want to make "No_outliers" column which contains the column names if the value of the column is an outlier. Also "-" for marking np.nan value.

So it should have "No_outliers" : ["","","b","","b, c",""] since 111 in column "b" and 777 in column "c" would be an outlier in each column.

I think I can use .any() here but I couldn't. I must have used it in a wrong way.

Hope you can help me with this.

Thank you!

1 Answer 1

1

We can combine multiple boolean masks using logical OR to create a resulting mask where a True values represent an outlier, then take the dot product of this mask with the columns and assign the result to No_outliers column

mask = df[domain].lt(min_v) | df[domain].gt(max_v) | df[domain].isna()
df['No_outliers'] = (mask @ (mask.columns + ', ')).str[:-2]

Result

print(df)

   a      b    c No_outliers
0  2    3.0    2            
1  3    4.0    3            
2  4    NaN    4           b
3  5    6.0    5            
4  6  111.0  777        b, c
5  8   22.0    1            
Sign up to request clarification or add additional context in comments.

5 Comments

Wow, thanks a lot. This is my first time seeing mask and regex "@" and df.lt/gt ... So many things to study! Thanks for the direction and the code! I shall study with documents of those. Have a wonderful day
Oh however it gives me AttributeError: 'Series' object has no attribute 'columns' for the second line.
Glad to help. Please make sure to use the code snippet out of the for-loop. Here domain is a list of columns: domain_list
Oh I got it now haha. Yeah I should have noticed that. Thanks again for the great help :)
@JeongInKim Happy coding!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.