Python - Remove duplicate elements from JSON based on value within JSON

Question

I have a JSON object which may contain duplicate items and locations and I want to keep the one with the highest risk (and only one of them)

[{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Low'
#Other values are omitted
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Moderate'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemtwo',
'location': 'locationtwo',
'risk_level': 'Low'
}]

I have tried converting it into a pandas dataframe, ordering it based on risk_level and using the drop_duplicates however this causes issues with other values in the JSON (e.g. converting None into NaN, int into floats etc.) so I don't think it's feasible.

    #Convert to dataframe and drop identical insights with lowest severities
    dfInsights = pd.DataFrame(response['data'])
    dfInsights = dfInsights.reindex(columns=list(response['data'][0].keys()))
    dfInsights.sort_values(['risk_level'], inplace=True)
    dfInsights.drop_duplicates(['item','location'], keep='first', inplace=True)
    dfToJSON = dfInsights.to_dict(orient='records')

I would like the result to be:

[{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemtwo',
'location': 'locationtwo',
'risk_level': 'Low'
}]

It would take a few passes. First pass would be to create a new list containing information about the first list including the index, location and risk_level (converted to an integer for sorting). The second loop would sort this new list by location and risk_level (integer in decending order). Third loop would iterate through sorted list would keep track if the location is repeat of prior entry and if it is flag this entry for deletion. Fourth loop would grab all entries from the original list which are not flagged for deletion. — Timothy C. Quinn
– Timothy C. Quinn, Commented Jul 18, 2019 at 3:32
Thanks Timothy, your suggestion has helped. It was rather fiddly but i got the correct output eventually. I will paste the code shortly — Ed Whittle
– Ed Whittle, Commented Jul 18, 2019 at 4:34

Andrej Kesely · Accepted Answer · 2019-07-18 06:16:37Z

You can utilize itertools.groupby with custom key function based on weights:

d = [{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Low'
#Other values are omitted
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Moderate'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemtwo',
'location': 'locationtwo',
'risk_level': 'Low'
}]

from itertools import groupby
from operator import itemgetter

f = itemgetter('item', 'location')
weights = {'Low':2, 'Moderate':1, 'High':0}

out = []
for v, g in groupby(sorted(d, key=lambda k: (f(k), weights[k['risk_level']])), key=f):
    out.append(next(g))

from pprint import pprint
pprint(out, width=30)

Prints:

[{'item': 'itemone',
  'location': 'locationone',
  'risk_level': 'High'},
 {'item': 'itemtwo',
  'location': 'locationtwo',
  'risk_level': 'Low'}]

Ed Whittle · Accepted Answer · 2019-07-18 04:39:13Z

Below is the solution thanks to Timothy's help:

import unittest

class TestRemoveDuplicates(unittest.TestCase):
    def setUp(self):
        pass

    def filter_dups(self, curr_doc, filtered_docs):
        for docs in filtered_docs:
            if (curr_doc['item'] == docs['item'] and curr_doc['location'] == docs['location']):
                if (curr_doc['risk_level'] <= (docs['risk_level'])):
                    return False
        return True

    def test_json(self):
        levels = [None, 'Low', 'Moderate', 'High', 'Critical']

        test_json = [
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'Low'
                        #Other values are omitted
                    },
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'High'
                    },
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'Moderate'
                    },
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'High'
                    },
                    {
                        'item': 'itemtwo',
                        'location': 'locationtwo',
                        'risk_level': 'Low'
                    }
                    ]

        risk_conv_json = []

        for docs in test_json:
            docs['risk_level'] = levels.index(docs['risk_level'])
            risk_conv_json.append(docs)

        sorted_json = (sorted(risk_conv_json, key=lambda x : x['risk_level'], reverse=True))

        filtered_json = []

        for curr_sorted_doc in sorted_json:
            if self.filter_dups(curr_sorted_doc, filtered_json):
                filtered_json.append(curr_sorted_doc)

        output_json = []

        for docs in filtered_json:
            docs['risk_level'] = levels[docs['risk_level']]
            output_json.append(docs)

        print(output_json)

    def tearDown(self):
        pass

Collectives™ on Stack Overflow

Python - Remove duplicate elements from JSON based on value within JSON

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related