How to merge multiple duplicate key names using python in a format like dictionary

Question

I had data in a format like dictionary where I the data had multiple duplicate keys repeated multiple times with strings in a list as values, I want to merge all the keys with the same name and their values, the data was happened to be in a format like dictionary but not an actual dictionary I am referring it as dictionary simply because of the way it was existed.

#Data I had looks like below,

"city":["New York", "Paris", "London"],
"country":["India", "France", "Italy"],
"city":["New Delhi", "Tokio", "Wuhan"],
"organisation":["ITC", "Google", "Facebook"],
"country":["Japan", "South Korea", "Germany"],
"organisation":["TATA", "Amazon", "Ford"]

I had 1000s of duplicate keys repeating with some repeated and unique values which I wanted merge or append based on key.

#Output Expected

"city":["New York", "Paris", "London", "New Delhi", "Tokio", "Wuhan"],
"country":["India", "France", "Italy", "Japan", "South Korea", "Germany"],
"organisation":["ITC", "Google", "Facebook", "TATA", "Amazon", "Ford"],

Can anyone suggest.

by definition this is not a python dictionary. a dictionary key is unique. paste into jupyter your source "dictionary" and you get as expected unique keys for the last instance of the key. what is generating the invalid dictionary / JSON? fix at source is my suggestion — Rob Raymond
– Rob Raymond, Commented Aug 12, 2021 at 11:36
not me down voting, just checking that requirement is to parse a certain format that work with dict have posted an answer — Rob Raymond
– Rob Raymond, Commented Aug 12, 2021 at 13:57
For background, see this earlier question: stackoverflow.com/questions/68753099/… — tripleee
– tripleee, Commented Aug 16, 2021 at 11:47

Rob Raymond · Accepted Answer · 2021-08-12 14:01:29Z

0

it's been established this is not a dict, it's a LR(1) grammar that is similar to a JSON grammar
taking this approach parse and tokenise it with an LR parser
https://lark-parser.readthedocs.io/en/latest/json_tutorial.html shows how to parse JSON
needs a small adaptation so that duplicate keys work (consider a dict as a list, see code)
have used pandas to take output from parser and reshape as you require

from lark import Transformer
from lark import Lark
import pandas as pd
json_parser = Lark(r"""
    ?value: dict
          | list
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    list : "[" [value ("," value)*] "]"

    dict : "{" [pair ("," pair)*] "}"
    pair : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING
    %import common.SIGNED_NUMBER
    %import common.WS
    %ignore WS

    """, start='value')
class TreeToJson(Transformer):
    def string(self, s):
        (s,) = s
        return s[1:-1]
    def number(self, n):
        (n,) = n
        return float(n)

    list = list
    pair = tuple
    dict = list # deal with issue of repeating keys...

    null = lambda self, _: None
    true = lambda self, _: True
    false = lambda self, _: False

js = """{
    "city":["New York", "Paris", "London"],
    "country":["India", "France", "Italy"],
    "city":["New Delhi", "Tokio", "Wuhan"],
    "organisation":["ITC", "Google", "Facebook"],
    "country":["Japan", "South Korea", "Germany"],
    "organisation":["TATA", "Amazon", "Ford"]
}"""    
    
tree = json_parser.parse(js)

pd.DataFrame(TreeToJson().transform(tree), columns=["key", "list"]).explode(
    "list"
).groupby("key").agg({"list": lambda s: s.unique().tolist()}).to_dict()["list"]

output

{'city': ['New York', 'Paris', 'London', 'New Delhi', 'Tokio', 'Wuhan'],
 'country': ['India', 'France', 'Italy', 'Japan', 'South Korea', 'Germany'],
 'organisation': ['ITC', 'Google', 'Facebook', 'TATA', 'Amazon', 'Ford']}

edited Aug 12, 2021 at 14:01

answered Aug 12, 2021 at 13:56

Rob Raymond

31.5k3 gold badges19 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sherlock Over a year ago

getting error UnexpectedCharacters: No terminal matches '"' in the current parser context, at line 5 col 2 "7 KARNES": ["7 KARNES"], ^ Expected one of: * RBRACE * COMMA while running the code on larger data set where I also had non duplicate keys

Rob Raymond Over a year ago

can you provide me with a larger data set and I'll take a look. I would expect it to work with any valid "bits" of JSON. If the bits are malformed it will cause an issue

Collectives™ on Stack Overflow

How to merge multiple duplicate key names using python in a format like dictionary

1 Answer 1

output

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

output

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related