1

I have a JSON object which may contain duplicate items and locations and I want to keep the one with the highest risk (and only one of them)

[{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Low'
#Other values are omitted
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Moderate'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemtwo',
'location': 'locationtwo',
'risk_level': 'Low'
}]

I have tried converting it into a pandas dataframe, ordering it based on risk_level and using the drop_duplicates however this causes issues with other values in the JSON (e.g. converting None into NaN, int into floats etc.) so I don't think it's feasible.

    #Convert to dataframe and drop identical insights with lowest severities
    dfInsights = pd.DataFrame(response['data'])
    dfInsights = dfInsights.reindex(columns=list(response['data'][0].keys()))
    dfInsights.sort_values(['risk_level'], inplace=True)
    dfInsights.drop_duplicates(['item','location'], keep='first', inplace=True)
    dfToJSON = dfInsights.to_dict(orient='records')

I would like the result to be:

[{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemtwo',
'location': 'locationtwo',
'risk_level': 'Low'
}]
2
  • It would take a few passes. First pass would be to create a new list containing information about the first list including the index, location and risk_level (converted to an integer for sorting). The second loop would sort this new list by location and risk_level (integer in decending order). Third loop would iterate through sorted list would keep track if the location is repeat of prior entry and if it is flag this entry for deletion. Fourth loop would grab all entries from the original list which are not flagged for deletion. Commented Jul 18, 2019 at 3:32
  • Thanks Timothy, your suggestion has helped. It was rather fiddly but i got the correct output eventually. I will paste the code shortly Commented Jul 18, 2019 at 4:34

2 Answers 2

2

You can utilize itertools.groupby with custom key function based on weights:

d = [{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Low'
#Other values are omitted
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'Moderate'
},
{
'item': 'itemone',
'location': 'locationone',
'risk_level': 'High'
},
{
'item': 'itemtwo',
'location': 'locationtwo',
'risk_level': 'Low'
}]

from itertools import groupby
from operator import itemgetter

f = itemgetter('item', 'location')
weights = {'Low':2, 'Moderate':1, 'High':0}

out = []
for v, g in groupby(sorted(d, key=lambda k: (f(k), weights[k['risk_level']])), key=f):
    out.append(next(g))

from pprint import pprint
pprint(out, width=30)

Prints:

[{'item': 'itemone',
  'location': 'locationone',
  'risk_level': 'High'},
 {'item': 'itemtwo',
  'location': 'locationtwo',
  'risk_level': 'Low'}]
Sign up to request clarification or add additional context in comments.

Comments

0

Below is the solution thanks to Timothy's help:

import unittest

class TestRemoveDuplicates(unittest.TestCase):
    def setUp(self):
        pass

    def filter_dups(self, curr_doc, filtered_docs):
        for docs in filtered_docs:
            if (curr_doc['item'] == docs['item'] and curr_doc['location'] == docs['location']):
                if (curr_doc['risk_level'] <= (docs['risk_level'])):
                    return False
        return True

    def test_json(self):
        levels = [None, 'Low', 'Moderate', 'High', 'Critical']

        test_json = [
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'Low'
                        #Other values are omitted
                    },
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'High'
                    },
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'Moderate'
                    },
                    {
                        'item': 'itemone',
                        'location': 'locationone',
                        'risk_level': 'High'
                    },
                    {
                        'item': 'itemtwo',
                        'location': 'locationtwo',
                        'risk_level': 'Low'
                    }
                    ]

        risk_conv_json = []

        for docs in test_json:
            docs['risk_level'] = levels.index(docs['risk_level'])
            risk_conv_json.append(docs)

        sorted_json = (sorted(risk_conv_json, key=lambda x : x['risk_level'], reverse=True))

        filtered_json = []

        for curr_sorted_doc in sorted_json:
            if self.filter_dups(curr_sorted_doc, filtered_json):
                filtered_json.append(curr_sorted_doc)

        output_json = []

        for docs in filtered_json:
            docs['risk_level'] = levels[docs['risk_level']]
            output_json.append(docs)

        print(output_json)

    def tearDown(self):
        pass

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.