String merge in Dataframe if condition fits

Question

I've searched some previous answers about this, but I want to know one thing more about it.

This is the test data:

df = pd.DataFrame({"a":[2,3,4,5,6,8],"b":[3,4,np.nan,6,111,22], "c" : [2,3,4,5, 777,1]})

And this is the one I'm working with to check if the value is outliers per column

def check_outliers(df, domain_list):
    
    outlier_column = []
    for domain in domain_list:
        Q1 = df[domain].quantile(0.25)
        Q3 = df[domain].quantile(0.75)
        IQR = Q3 - Q1
        min_v = Q1 - (1.5 * IQR)
        max_v = Q3 + (1.5 * IQR)
    
        df["No_outliers_"+domain] = np.where(np.isnan(df[domain]),"-",np.where((df[domain] >= min_v) & (df[domain] <= max_v), "O",domain))
        outlier_column.append("No_outliers_"+domain)
    
    #df["No_outliers"] = np.where()

    df = df.drop(outlier_column, axis=1)
    return df

df = check_outliers(df,["a","b","c"])

I see that many recommends to use np.where or np.select for this, but what I want to know more is that dealing with multiple columns for the condition. I want to make "No_outliers" column which contains the column names if the value of the column is an outlier. Also "-" for marking np.nan value.

So it should have "No_outliers" : ["","","b","","b, c",""] since 111 in column "b" and 777 in column "c" would be an outlier in each column.

I think I can use .any() here but I couldn't. I must have used it in a wrong way.

Hope you can help me with this.

Thank you!

Shubham Sharma · Accepted Answer · 2022-02-21 04:16:00Z

1

We can combine multiple boolean masks using logical OR to create a resulting mask where a True values represent an outlier, then take the dot product of this mask with the columns and assign the result to No_outliers column

mask = df[domain].lt(min_v) | df[domain].gt(max_v) | df[domain].isna()
df['No_outliers'] = (mask @ (mask.columns + ', ')).str[:-2]

Result

print(df)

   a      b    c No_outliers
0  2    3.0    2            
1  3    4.0    3            
2  4    NaN    4           b
3  5    6.0    5            
4  6  111.0  777        b, c
5  8   22.0    1

answered Feb 21, 2022 at 4:16

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jeong In Kim Over a year ago

Wow, thanks a lot. This is my first time seeing mask and regex "@" and df.lt/gt ... So many things to study! Thanks for the direction and the code! I shall study with documents of those. Have a wonderful day

Jeong In Kim Over a year ago

Oh however it gives me AttributeError: 'Series' object has no attribute 'columns' for the second line.

Shubham Sharma Over a year ago

Glad to help. Please make sure to use the code snippet out of the for-loop. Here domain is a list of columns: domain_list

Jeong In Kim Over a year ago

Oh I got it now haha. Yeah I should have noticed that. Thanks again for the great help :)

Shubham Sharma Over a year ago

@JeongInKim Happy coding!

Collectives™ on Stack Overflow

String merge in Dataframe if condition fits

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related