1

My dataframe looks like this: Data Frame

I am trying to cluster data manually using if-else logic on two columns of a data frame and want to create a new column dynamically based on the return value of the function.

How should I pass the data to the following custom function:

def cluster(Data):
    if Data.WMCI_range == "Low" and Data.Store_Format == "Small":
        return "Low_Small"

    elif Data.WMCI_range == "Medium" and Data.Store_Format == "Medium":
        return "Medium_Medium"

    elif Data.WMCI_range == "High" and Data.Store_Format == "Large":
        return "High_large"

    elif Data.WMCI_range == "Low" and Data.Store_Format == "Medium":
        return "low_Medium"

    elif Data.WMCI_range == "Low" and Data.Store_Format == "Large":
        return "low_High"

    elif Data.WMCI_range == "Medium" and Data.Store_Format == "Small":
        return "low_High"

    elif Data.WMCI_range == "Medium" and Data.Store_Format == "Large":
        return "Medium_Large"

    elif Data.WMCI_range == "High" and Data.Store_Format == "Small":
        return "High_Small"

    elif Data.WMCI_range == "High" and Data.Store_Format == "Medium":
        return "High_Medium"

I have tried these three data passing techniques but did not work:

Data['Clusters'] = cluster(Data[['WMCI_range', 'Store_Format']])
Data['Clusters'] = [cluster(i) for i in len(Data[['WMCI_range', 'Store_Format']])]

Please help me find a solution.

You can use this code to mock the data as I did:

columns = ["Small", "Medium", "Large"]
store_Format = random.choices(columns, weights=[6, 8, 5], k=4500)
WMCI = []
for i in range(1, 4500 + 1):
    n = random.randint(1, 9)
    WMCI.append(n)

df = pd.DataFrame({"Store_Format": store_Format, "WMCI": WMCI})

1 Answer 1

1

So, given the following dataframe:

columns = ["Small", "Medium", "Large"]
store_Format = random.choices(columns, weights=[6, 8, 5], k=4500)
WMCI = []
for i in range(1, 4500 + 1):
    n = random.randint(1, 9)
    WMCI.append(n)

df = pd.DataFrame({"WMCI": WMCI, "Store_Format": store_Format})
print(df)
# Outputs
      WMCI Store_Format
0        6       Medium
1        1        Large
2        6       Medium
...    ...          ...
4497     6       Medium
4498     1       Medium
4499     7       Medium

Instead of using a custom helper function, I would suggest a much easier and efficient way to achieve the computation of clusters:

df.loc[df["WMCI"] <= 3, "Clusters"] = (
    "Small_" + df.loc[df["WMCI"] <= 3, "Store_Format"]
)

df.loc[(df["WMCI"] > 3) & (df["WMCI"] <= 6), "Clusters"] = (
    "Medium_" + df.loc[(df["WMCI"] > 3) & (df["WMCI"] <= 6), "Store_Format"]
)

df.loc[(df["WMCI"] > 6) & (df["WMCI"] <= 9), "Clusters"] = (
    "High_" + df.loc[(df["WMCI"] > 6) & (df["WMCI"] <= 9), "Store_Format"]
)

Which gives you:

print(df)
# Outputs
      WMCI Store_Format       Clusters
0        6       Medium  Medium_Medium
1        1        Large    Small_Large
2        6       Medium  Medium_Medium
3        7       Medium    High_Medium
4        9        Large     High_Large
...    ...          ...            ...
4495     9        Large     High_Large
4496     3        Small    Small_Small
4497     6       Medium  Medium_Medium
4498     1       Medium   Small_Medium
4499     7       Medium    High_Medium
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.