0
\$\begingroup\$

I wrote this code to indicate duplicated values. It actually works but I hope to know if there's another possible solution to optimize this process. Thanks.

import tkinter as tk
from tkinter import filedialog
import pandas as pd
import numpy as np
// load excel file
def load_excel():
    global df
    file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
    if file_path:
        df = pd.read_excel(file_path)
        label_load.pack()
// find duplicates and indicate them by certain number of periods
def process_excel():
    label_load.pack_forget()
    for i in df.columns:
        for j in df[i]:
            same_value_positions = np.where(df.values == j)
            if same_value_positions[0].size >= 2:
                for k in range(1, len(same_value_positions[0])):
                    dot_arr = '.' * k
                    l = [same_value_positions[0][k], same_value_positions[1][k]]
                    df.iat[l[0], l[1]+1] = f'{j}{dot_arr}'
    df.to_excel("output.xlsx", index=False)
    label_done.pack()
// run the script
root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()
\$\endgroup\$
4
  • 2
    \$\begingroup\$ Have you attempted to run this? // is not a comment in Python. \$\endgroup\$ Commented Oct 3, 2023 at 15:22
  • \$\begingroup\$ yeah, right. the comments are what I just added to upload this as the stackexchange engine required it. This is a working script. The result is if there are 3 duplicates of 'A' my script adds A. , A.. and A... in the next cell(next column) of each duplicated cell. \$\endgroup\$ Commented Oct 4, 2023 at 1:19
  • \$\begingroup\$ Please post the actual working code for review. There is no reason to show something different to what you have tested and are ready to deploy (certainly not "stackexchange engine required it", whatever that means). \$\endgroup\$ Commented Oct 8, 2023 at 9:07
  • \$\begingroup\$ The code is-as cannot run as // load excel file will result in a NameError. Just change the comments to actual Python ones or remove the comments entirely. \$\endgroup\$ Commented Oct 13, 2023 at 15:29

2 Answers 2

1
\$\begingroup\$

As @suchislife already mentioned you can use duplicated method

def process_excel():
    label_load.pack_forget()
    for column in df.columns:
        duplicates = df[df.duplicated(subset=[column], keep=False)]
        if len(duplicates) > 0:
            for _, row in duplicates.iterrows():
                for i, value in enumerate(row):
                    dot_arr = '.' * (i + 1)
                    df_copy.at[row.name, column] = f'{value}{dot_arr}'
    df.to_excel("output.xlsx", index=False)
    label_done.pack()
\$\endgroup\$
0
\$\begingroup\$

The code could be optimized by using pandas functions instead of iterating through each value.

Explanation:

  • groupby(col)[col].transform(): Instead of iterating through each value, we group the dataframe by each value in the current column. Using transform ensures the operation's result has the same shape as the original data.

  • x.duplicated(keep='first').cumsum(): Within each group, we identify duplicated values using the duplicated method. The keep='first' ensures the first occurrence is not considered a duplicate. The cumsum() function then provides an incremental number for each subsequent duplicate which is used to determine the number of dots to append.

  • x + ('.' * <count>): For each value in the group, it appends the calculated number of dots based on the cumulative sum of duplicates.

import tkinter as tk
from tkinter import filedialog
import pandas as pd

def load_excel():
    global df
    file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
    if file_path:
        df = pd.read_excel(file_path)
        label_load.pack()

def process_excel():
    label_load.pack_forget()
    for col in df.columns:
        df[col] = df.groupby(col)[col].transform(lambda x: x + ('.' * (x.duplicated(keep='first').cumsum())))
    df.to_excel("output.xlsx", index=False)
    label_done.pack()

root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()

Approach #2

• Use duplicated across the entire DataFrame and iterate only where duplicates are found.

import tkinter as tk
from tkinter import filedialog
import pandas as pd

def load_excel():
    global df
    file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
    if file_path:
        df = pd.read_excel(file_path)
        label_load.pack()

def process_excel():
    label_load.pack_forget()
    duplicates = df[df.duplicated(keep=False)]
    for index, row in duplicates.iterrows():
        for col in df.columns:
            if df.at[index, col] == row[col]:
                occurrence = duplicates[duplicates[col] == row[col]][col].cumcount() + 1
                df.at[index, col] = f'{row[col]}.' * occurrence.iat[0]
    df.to_excel("output.xlsx", index=False)
    label_done.pack()

root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()

Explanation:

  • df[df.duplicated(keep=False)]: Finds all duplicates in the entire DataFrame, including the first occurrence.
  • duplicates[duplicates[col] == row[col]][col].cumcount(): Within the duplicates, counts the cumulative occurrence of each value to determine the number of dots.
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for you kind comment S. The duplicated cells appear all through the dataframe not in one column. \$\endgroup\$ Commented Oct 5, 2023 at 2:50

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.