I've searched some previous answers about this, but I want to know one thing more about it.
This is the test data:
df = pd.DataFrame({"a":[2,3,4,5,6,8],"b":[3,4,np.nan,6,111,22], "c" : [2,3,4,5, 777,1]})
And this is the one I'm working with to check if the value is outliers per column
def check_outliers(df, domain_list):
outlier_column = []
for domain in domain_list:
Q1 = df[domain].quantile(0.25)
Q3 = df[domain].quantile(0.75)
IQR = Q3 - Q1
min_v = Q1 - (1.5 * IQR)
max_v = Q3 + (1.5 * IQR)
df["No_outliers_"+domain] = np.where(np.isnan(df[domain]),"-",np.where((df[domain] >= min_v) & (df[domain] <= max_v), "O",domain))
outlier_column.append("No_outliers_"+domain)
#df["No_outliers"] = np.where()
df = df.drop(outlier_column, axis=1)
return df
df = check_outliers(df,["a","b","c"])
I see that many recommends to use np.where or np.select for this, but what I want to know more is that dealing with multiple columns for the condition.
I want to make "No_outliers" column which contains the column names if the value of the column is an outlier. Also "-" for marking np.nan value.
So it should have "No_outliers" : ["","","b","","b, c",""] since 111 in column "b" and 777 in column "c" would be an outlier in each column.
I think I can use .any() here but I couldn't. I must have used it in a wrong way.
Hope you can help me with this.
Thank you!