remove the duplicate rows in CSV file

Question

I have the following python function that exports JSON data to CSV file, it works fine - the keys(csv headers) and values(csv rows) are populated in the CSV, but I'm trying to remove the duplicates rows in the the csv file?

instead of manually removing them in Excel, how do I remove the duplicate values in python?

 def toCSV(res):
        with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
            dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames,extrasaction='ignore')
            dict_writer.writeheader()
            for r in res:
                dict_writer.writerow(r)

Thank you

for example in the csv, the duplicate row on apmsgfwd.exe information.

duplicate data below:

process_hash    process_name    process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0']    c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml   NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37']    c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe   ADAPTIVE_WHITE_LIST

json data:

[{'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b1bvf6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toh122soft\\thcasdf3\\toho34rce.exe', 'process_username': ['JOHN\\user1']}, {'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe', 'process_username': ['JOHN\\user2']}, {'device_name': '6asdsdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f698e11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a', 'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe', 'process_username': ['JOHN\\user3']}]

please do no upload images - add the csv (or a subset of it) as text. — balderman
– balderman, Commented Nov 5, 2020 at 15:22
Why you shouldn't upload images of text: meta.stackoverflow.com/a/285557/843953 — pho
– pho, Commented Nov 5, 2020 at 15:24

Prakriti Shaurya · Accepted Answer · 2020-11-05 17:24:54Z

1

Is it necessary to use above approach, if not then I usually use pandas library for reading csv files.

import pandas as pd

data = pd.read_csv('EnrichedEvents.csv')
data.drop_duplicates(inplace=True)

data.to_csv('output.csv',index=False)

answered Nov 5, 2020 at 17:24

Prakriti Shaurya

1951 gold badge3 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

balderman · Accepted Answer · 2020-11-05 17:28:08Z

0

Below is standalone example that shows how to filter duplicates. The idea is to get the values of each dict and convert them into tuple. Using a set we can filter out the duplicates.

import csv

csv_columns = ['No', 'Name', 'Country']
dict_data = [
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 2, 'Name': 'Ben', 'Country': ['USA']},

]
csv_file = "Names.csv"

with open(csv_file, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
    writer.writeheader()
    entries = set()
    for data in dict_data:
        val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
        if val not in entries:
            writer.writerow(data)
            entries.add(val)
print('done')

Names.csv

 No,Name,Country
1,Alex,['India']
2,Ben,['USA']

edited Nov 5, 2020 at 17:28

answered Nov 5, 2020 at 15:33

balderman

24k8 gold badges39 silver badges60 bronze badges

6 Comments

user3704597 Over a year ago

thanks balderman, sorry i should have mentioned earlier, how do use your approach if the 'res' variable holds list of dicts(json data)

balderman Over a year ago

@user3704597 did you try it?

user3704597 Over a year ago

hi balderman, i get the following error: if r not in r_set: TypeError: unhashable type: 'dict'

user3704597 Over a year ago

thanks for helping me out balderman, i tried the new code above. but now I'm getting the error below(i think its because there are values with list type - for example india is in [india]) if val not in entries: TypeError: unhashable type: 'list'

balderman Over a year ago

@PranavHosangadi OK - explanations added.

|

Collectives™ on Stack Overflow

remove the duplicate rows in CSV file

2 Answers 2

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related