Optimize a Python code which indicates duplicated values in an excel file [closed]

Question

Closed. This question is off-topic. It is not currently accepting answers.

Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review.

Closed 2 years ago.

Improve this question

I wrote this code to indicate duplicated values. It actually works but I hope to know if there's another possible solution to optimize this process. Thanks.

import tkinter as tk
from tkinter import filedialog
import pandas as pd
import numpy as np
// load excel file
def load_excel():
    global df
    file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
    if file_path:
        df = pd.read_excel(file_path)
        label_load.pack()
// find duplicates and indicate them by certain number of periods
def process_excel():
    label_load.pack_forget()
    for i in df.columns:
        for j in df[i]:
            same_value_positions = np.where(df.values == j)
            if same_value_positions[0].size >= 2:
                for k in range(1, len(same_value_positions[0])):
                    dot_arr = '.' * k
                    l = [same_value_positions[0][k], same_value_positions[1][k]]
                    df.iat[l[0], l[1]+1] = f'{j}{dot_arr}'
    df.to_excel("output.xlsx", index=False)
    label_done.pack()
// run the script
root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()

Have you attempted to run this? // is not a comment in Python. — Reinderien
– Reinderien, Commented Oct 3, 2023 at 15:22
yeah, right. the comments are what I just added to upload this as the stackexchange engine required it. This is a working script. The result is if there are 3 duplicates of 'A' my script adds A. , A.. and A... in the next cell(next column) of each duplicated cell. — peternish
– peternish, Commented Oct 4, 2023 at 1:19
Please post the actual working code for review. There is no reason to show something different to what you have tested and are ready to deploy (certainly not "stackexchange engine required it", whatever that means). — Toby Speight
– Toby Speight, Commented Oct 8, 2023 at 9:07
The code is-as cannot run as // load excel file will result in a NameError. Just change the comments to actual Python ones or remove the comments entirely. — Peilonrayz
– Peilonrayz ♦, Commented Oct 13, 2023 at 15:29

Pavel Nekrasov · Accepted Answer · 2023-10-13 12:01:07Z

As @suchislife already mentioned you can use duplicated method

def process_excel():
    label_load.pack_forget()
    for column in df.columns:
        duplicates = df[df.duplicated(subset=[column], keep=False)]
        if len(duplicates) > 0:
            for _, row in duplicates.iterrows():
                for i, value in enumerate(row):
                    dot_arr = '.' * (i + 1)
                    df_copy.at[row.name, column] = f'{value}{dot_arr}'
    df.to_excel("output.xlsx", index=False)
    label_done.pack()

suchislife · Accepted Answer · 2023-10-05 12:47:15Z

The code could be optimized by using pandas functions instead of iterating through each value.

Explanation:

groupby(col)[col].transform(): Instead of iterating through each value, we group the dataframe by each value in the current column. Using transform ensures the operation's result has the same shape as the original data.
x.duplicated(keep='first').cumsum(): Within each group, we identify duplicated values using the duplicated method. The keep='first' ensures the first occurrence is not considered a duplicate. The cumsum() function then provides an incremental number for each subsequent duplicate which is used to determine the number of dots to append.
x + ('.' * <count>): For each value in the group, it appends the calculated number of dots based on the cumulative sum of duplicates.

import tkinter as tk
from tkinter import filedialog
import pandas as pd

def load_excel():
    global df
    file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
    if file_path:
        df = pd.read_excel(file_path)
        label_load.pack()

def process_excel():
    label_load.pack_forget()
    for col in df.columns:
        df[col] = df.groupby(col)[col].transform(lambda x: x + ('.' * (x.duplicated(keep='first').cumsum())))
    df.to_excel("output.xlsx", index=False)
    label_done.pack()

root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()

Approach #2

• Use duplicated across the entire DataFrame and iterate only where duplicates are found.

import tkinter as tk
from tkinter import filedialog
import pandas as pd

def load_excel():
    global df
    file_path = filedialog.askopenfilename(filetypes=[("Excel Files", "*.xlsx")])
    if file_path:
        df = pd.read_excel(file_path)
        label_load.pack()

def process_excel():
    label_load.pack_forget()
    duplicates = df[df.duplicated(keep=False)]
    for index, row in duplicates.iterrows():
        for col in df.columns:
            if df.at[index, col] == row[col]:
                occurrence = duplicates[duplicates[col] == row[col]][col].cumcount() + 1
                df.at[index, col] = f'{row[col]}.' * occurrence.iat[0]
    df.to_excel("output.xlsx", index=False)
    label_done.pack()

root = tk.Tk()
root.geometry('700x400')
button = tk.Button(root, text="Load Excel", command=load_excel, font=('sans', 16))
button.pack(pady=20)
label_load = tk.Label(root, text="Excel file loaded", font=('sans', 16))
button = tk.Button(root, text="Process it", command=process_excel, font=('sans', 16))
button.pack(pady=20)
label_done = tk.Label(root, text="Process Done", font=('sans', 16))
root.mainloop()

Explanation:

df[df.duplicated(keep=False)]: Finds all duplicates in the entire DataFrame, including the first occurrence.
duplicates[duplicates[col] == row[col]][col].cumcount(): Within the duplicates, counts the cumulative occurrence of each value to determine the number of dots.

Thanks for you kind comment S. The duplicated cells appear all through the dataframe not in one column. — peternish
– peternish, Commented Oct 5, 2023 at 2:50

Stack Exchange Network

Optimize a Python code which indicates duplicated values in an excel file [closed]

2 Answers 2

Approach #2

Hot Network Questions

Optimize a Python code which indicates duplicated values in an excel file [closed]

2 Answers 2

Approach #2

Related

Hot Network Questions