1

I know there are a million questions like this, I just can't find an answer that works for me.

I have this:

list1 =   [{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H']}, {'assembly_id': '1', 'asym_id_list': ['C', 'D', 'F', 'I', 'J']}, {'assembly_id':2,'asym_id_list':['D,C'],'auth_id_list':['C','V']}]

if the assembly_ids are the same, I want to combine the other same keys in the dict.

In this example, assembly_id 1 appears twice, so the input above would turn into:

[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H','C', 'D', 'F', 'I', 'J']},{'assembly_id':2,'asym_id_list:['D,C'],'auth_id_list':['C','V']}]

In theory there can be n assembly_ids (i.e. assembly 1 could appear in the dict 10 or 20 times, not just 2) and there can be up to two other lists to combine (asym_id_list and auth_id_list).

I was looking at this method:

new_dict = {}
assembly_list = [] #to keep track of assemblies already seen
for dict_name in list1: #for each dict in the list
        if dict_name['assembly_id'] not in assembly_list: #if the assembly id is new
                new_dict['assembly_id'] = dict_name #this line is wrong, add the entry to new_dict
                assembly_list.append(new_dict['assembly_id']) #append the id to 'assembly_list'
        else:
                new_dict['assembly_id'].append(dict_name) #else if it's already seen, append the dictionaries together, this is wrong
print(new_dict)

The output is wrong:

{'assembly_id': {'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']}}

But I think the idea is right, that I should open a new list and dict, and if not seen before, append; whereas if it has been seen before...combine? But it's just the specifics I'm not getting?

2
  • Reconsider the outer-most loop: You want to get back a list of dictionaries. Then you can append the dict_name to that list, if dict_name['assembly_list'] was not seen before and you can just add the lists 'asym_id_list' and 'auth_id_list' if it was seen before. Commented Jun 4, 2020 at 15:48
  • I think that you a couple typos in your example data. I think that you meant the 'D,C' in 'assembly_id':2,'asym_id_list':['D,C'] to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C']. Also in the 'assembly_id' keys you have a mixture of strings and ints (i.e. '1' and 2). Although that will work, I am guessing that you did not intend the keys to be a mixture of ints and strings Commented Jun 4, 2020 at 18:53

3 Answers 3

1

You are logically thinking correctly, we can use a dictionary m which contains key, value pairs of assembly_id and its corresponding dictionary to keep track of visited assembly_ids, whenever a new assembly_id is encountered we add it to the dictionary m otherwise if its already contain the assembly_id we just extend the asym_id_list, auth_id_list for that assembly_id:

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', []) + d['asym_id_list']
            elif 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', []) + d['auth_id_list']
        else:
            m[key] = d
    return list(m.values())

Result:

# merge(list1)
[
    {
        'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']
    },
    {
        'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']
    }
]
Sign up to request clarification or add additional context in comments.

1 Comment

You have a slight error in this code.. The elif 'auth_id_list' in d: should not be an elif, The line should be if 'auth_id_list' in d:, since the entries can have both 'asym_id_list' and 'auth_id_list' entries.
1

Use a dict keyed on assembly_id to collect all the data for a given key; you can then go back and generate a list of dicts in the original format if needed.

>>> from collections import defaultdict
>>> from typing import Dict, List
>>> id_lists: Dict[str, List[str]] = defaultdict(list)
>>> for d in list1:
...     id_lists[d['assembly_id']].extend(d['asym_id_list'])
...
>>> combined_list = [{
...     'assembly_id': id, 'asym_id_list': id_list
... } for id, id_list in id_lists.items()]
>>> combined_list
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']}, {'assembly_id': 2, 'asym_id_list': ['D,C']}]
>>>

(edit) didn't see the bit about auth_id_lists because it's hidden in the scroll in the original code -- same strategy applies, just either use two dicts in the first step or make it a dict of some collection of lists (e.g. a dict of dicts of lists, with the outer dict keyed on assembly_id values and the inner dict keyed on the original field name).

Comments

0

@Samwise has provided a good answer to the question you asked and this is not intended to replace that. However, I am going to make a suggestion to the way you are keeping the data after the merge. I would put this in a comment but there is no way to keep code formatting in a comment and it is a bit too big as well.

Before that, I think that you have a typo in your example data. I think that you meant the 'D,C' in 'assembly_id':2,'asym_id_list':['D,C'] to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C']. I am going to assume that below, but if not it does not change any of the code or comments.

Instead of the merged structure being a list of dictionaries like this:

merge_l = [
            {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          ]

Instead, I would recommend not using a list as the top level structure, but instead using a dictionary keyed by the value of the assembly_id. So it would be a dictionary whos values are dictionaries. Like this:

merge_d = { '1': {'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

or if you want to keep the 'assembly_id' as well, like this:

merge_d = { '1': {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

That last one can be achieved by just changing the return from @Samwise's merge() method and just return m instead of converting the dict to a list.

One other comment on @Samwise code, just so you are aware of it, is that the combined lists can contain duplicates. So if the original data had asym_id_list': ['A', 'B'] in one entry and asym_id_list': ['B', 'C'] in another, the combined list would contain asym_id_list': ['A', 'B', 'B', 'C']. That could be what you want, but if you want to avoid that you could use sets instead of lists for the internal container for asym_id and auth_id containers.

In @Samwise answer, change it something like this:

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', set()) | set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', set()) | set(d['auth_id_list'])
        else:
            m[key] = {'assembly_id': d['assembly_id']}
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = set(d['auth_id_list'])
    return m

If you go this way, you might want to reconsider the key names 'asym_id_list' and 'auth_id_list' since they are sets not lists. But that may be constrained by the other code around this and what it expects.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.