0

I have the following python function that exports JSON data to CSV file, it works fine - the keys(csv headers) and values(csv rows) are populated in the CSV, but I'm trying to remove the duplicates rows in the the csv file?

instead of manually removing them in Excel, how do I remove the duplicate values in python?

 def toCSV(res):
        with open('EnrichedEvents.csv', 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['process_hash', 'process_name', "process_effective_reputation"]
            dict_writer = csv.DictWriter(csvfile, fieldnames=fieldnames,extrasaction='ignore')
            dict_writer.writeheader()
            for r in res:
                dict_writer.writerow(r)

Thank you

for example in the csv, the duplicate row on apmsgfwd.exe information.

duplicate data below:

process_hash    process_name    process_effective_reputation
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['73ca11f2acf1adb7802c2914e1026db899a3c851cd9500378c0045e0']    c:\users\zdr3dds01\documents\sap\sap gui\export.mhtml   NOT_LISTED
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['f810a809e9cdf70c3189008e07c83619', '58d44528b60d36b515359fe234c9332ccef6937f5c950472230ce15dca8812e2']    c:\windows\system32\delltpad\apmsgfwd.exe   ADAPTIVE_WHITE_LIST
['582f018bc7a732d63f624d6f92b3d143', '66505bcb9975d61af14dd09cddd9ac0d11a3e2b5ae41845c65117e7e2b046d37']    c:\users\jij09\appdata\local\kingsoft\power word 2016\2016.3.3.0368\powerword.exe   ADAPTIVE_WHITE_LIST

json data:

[{'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b1bvf6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935830f3a9df8e9b6425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toh122soft\\thcasdf3\\toho34rce.exe', 'process_username': ['JOHN\\user1']}, {'device_name': 'fk6sdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f6e17ee11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['bfc7dcf5935f3a9df8e9b6830425c37a', 'ca9f3a24506cc518fc939a33c100b2d557f96e040f712f6dd4641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\oft\\tf3\\tootsice.exe', 'process_username': ['JOHN\\user2']}, {'device_name': '6asdsdc2', 'device_timestamp': '2020-10-27T00:50:46.176Z', 'event_id': '9b151f698e11eb81b', 'process_effective_reputation': 'LIST', 'process_hash': ['9df8ebfc7dcf5935830f3a9b6425c37a', 'ca9f3a24506cc518ff6ddc939a33c100b2d557f96e040f7124641ad1734e2f19'], 'process_name': 'c:\\program files (x86)\\toht\\th3\\tohce.exe', 'process_username': ['JOHN\\user3']}]
6
  • share an example of the csv and define 'duplicate' Commented Nov 5, 2020 at 15:14
  • thanks balderman, i've added a screenshot of the csv file, Commented Nov 5, 2020 at 15:21
  • 2
    please do no upload images - add the csv (or a subset of it) as text. Commented Nov 5, 2020 at 15:22
  • Why you shouldn't upload images of text: meta.stackoverflow.com/a/285557/843953 Commented Nov 5, 2020 at 15:24
  • sorry about that, i've added csv data in text Commented Nov 5, 2020 at 15:28

2 Answers 2

1

Is it necessary to use above approach, if not then I usually use pandas library for reading csv files.

import pandas as pd

data = pd.read_csv('EnrichedEvents.csv')
data.drop_duplicates(inplace=True)

data.to_csv('output.csv',index=False)
Sign up to request clarification or add additional context in comments.

Comments

0

Below is standalone example that shows how to filter duplicates. The idea is to get the values of each dict and convert them into tuple. Using a set we can filter out the duplicates.

import csv

csv_columns = ['No', 'Name', 'Country']
dict_data = [
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 1, 'Name': 'Alex', 'Country': ['India']},
    {'No': 2, 'Name': 'Ben', 'Country': ['USA']},

]
csv_file = "Names.csv"

with open(csv_file, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
    writer.writeheader()
    entries = set()
    for data in dict_data:
        val = tuple(','.join(v) if isinstance(v, list) else v for v in data.values())
        if val not in entries:
            writer.writerow(data)
            entries.add(val)
print('done')

Names.csv

 No,Name,Country
1,Alex,['India']
2,Ben,['USA']

6 Comments

thanks balderman, sorry i should have mentioned earlier, how do use your approach if the 'res' variable holds list of dicts(json data)
@user3704597 did you try it?
hi balderman, i get the following error: if r not in r_set: TypeError: unhashable type: 'dict'
thanks for helping me out balderman, i tried the new code above. but now I'm getting the error below(i think its because there are values with list type - for example india is in [india]) if val not in entries: TypeError: unhashable type: 'list'
@PranavHosangadi OK - explanations added.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.