0

I'm sort of new to Python and I am trying to figure out how to find all duplicates within a JSON file. So far I've created this python script to open and read the JSON file and parse the JSON report. I need to figure out a way to find all potential duplicate transactions and to print each line to contain the date, amount, description, and transactionID. Please let me know if I am on the correct path, any suggestions or pointers would help.

from asyncio.base_tasks import _task_print_stack
import json

#Opens the Formatted JSON File
file_handle = open("42525022_formatted-1.json", "r")
contents = file_handle.read()
#Parses the JSON file - categories report, items, accounts and transactions.
parsed = json.loads(contents)
transactions = parsed["report"]["items"][0]["accounts"][0]["transactions"]

transactions_by_date ={}

for txn in transactions:
    date = txn["date"]
    description = txn["original_description"]
    if date not in transactions_by_date:
        transactions_by_date[date] = []
    transactions_by_date[date].append(
        {
            "amount": txn["amount"],
            "description": txn["original_description"],
            "transaction_id": txn["transaction_id"]
        }
    )    
#Ignored    
#print(txn["date"] + "\n"  + str(txn["amount"]))
#print(transactions_by_date)

for date in transactions_by_date:
    transactions = transactions_by_date[date]
    print(transactions)
    break

#Objective
#Print all duplicates within a calendar date should have date, amount, description and transactionID 

Example JSON File Contents

                "account_id": "zbbbZEdzo4iZbed98AbzHeqr3VX0NztOBQgZe",
                "amount": 0,
                "date": "2022-07-02",
                "iso_currency_code": "USD",
                "original_description": "GOOGLE *ADS598329",
                "pending": true,
                "transaction_id": "1XXX9XbVRKHj8eN66",
                "unofficial_currency_code": null
              },
0

1 Answer 1

1

Would just detecting a duplicate ID be sufficient, or is there a chance there are multiple transactions with the same ID, but differing values for the other attributes? I know you asked about achieving this via python dictionary, however an additional tool would help here
I would suggesting using a library like pandas. Then you can think of your data as in a spreadsheet.

import pandas as pd
df = pd.DataFrame(transactions)
duplicates = df.duplicated()

Check out the documentation:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

Sign up to request clarification or add additional context in comments.

2 Comments

No, I am trying to detect and print the following fields, date, amount, description, and transactionID. Is this possible with the pandas library?
Yes. You can specify which columns to include as an argument "subset" to the duplicated method. ie duplicates = df.duplicated(subset=['date','amount','description','transactionID']) This way it will check to see if any rows with the subset of columns are duplicates.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.