-4

I had data in a format like dictionary where I the data had multiple duplicate keys repeated multiple times with strings in a list as values, I want to merge all the keys with the same name and their values, the data was happened to be in a format like dictionary but not an actual dictionary I am referring it as dictionary simply because of the way it was existed.

#Data I had looks like below,

"city":["New York", "Paris", "London"],
"country":["India", "France", "Italy"],
"city":["New Delhi", "Tokio", "Wuhan"],
"organisation":["ITC", "Google", "Facebook"],
"country":["Japan", "South Korea", "Germany"],
"organisation":["TATA", "Amazon", "Ford"]

I had 1000s of duplicate keys repeating with some repeated and unique values which I wanted merge or append based on key.

#Output Expected

"city":["New York", "Paris", "London", "New Delhi", "Tokio", "Wuhan"],
"country":["India", "France", "Italy", "Japan", "South Korea", "Germany"],
"organisation":["ITC", "Google", "Facebook", "TATA", "Amazon", "Ford"],

Can anyone suggest.

3
  • 2
    by definition this is not a python dictionary. a dictionary key is unique. paste into jupyter your source "dictionary" and you get as expected unique keys for the last instance of the key. what is generating the invalid dictionary / JSON? fix at source is my suggestion Commented Aug 12, 2021 at 11:36
  • not me down voting, just checking that requirement is to parse a certain format that work with dict have posted an answer Commented Aug 12, 2021 at 13:57
  • 1
    For background, see this earlier question: stackoverflow.com/questions/68753099/… Commented Aug 16, 2021 at 11:47

1 Answer 1

0
  • it's been established this is not a dict, it's a LR(1) grammar that is similar to a JSON grammar
  • taking this approach parse and tokenise it with an LR parser
  • https://lark-parser.readthedocs.io/en/latest/json_tutorial.html shows how to parse JSON
  • needs a small adaptation so that duplicate keys work (consider a dict as a list, see code)
  • have used pandas to take output from parser and reshape as you require
from lark import Transformer
from lark import Lark
import pandas as pd
json_parser = Lark(r"""
    ?value: dict
          | list
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    list : "[" [value ("," value)*] "]"

    dict : "{" [pair ("," pair)*] "}"
    pair : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING
    %import common.SIGNED_NUMBER
    %import common.WS
    %ignore WS

    """, start='value')
class TreeToJson(Transformer):
    def string(self, s):
        (s,) = s
        return s[1:-1]
    def number(self, n):
        (n,) = n
        return float(n)

    list = list
    pair = tuple
    dict = list # deal with issue of repeating keys...

    null = lambda self, _: None
    true = lambda self, _: True
    false = lambda self, _: False

js = """{
    "city":["New York", "Paris", "London"],
    "country":["India", "France", "Italy"],
    "city":["New Delhi", "Tokio", "Wuhan"],
    "organisation":["ITC", "Google", "Facebook"],
    "country":["Japan", "South Korea", "Germany"],
    "organisation":["TATA", "Amazon", "Ford"]
}"""    
    
tree = json_parser.parse(js)

pd.DataFrame(TreeToJson().transform(tree), columns=["key", "list"]).explode(
    "list"
).groupby("key").agg({"list": lambda s: s.unique().tolist()}).to_dict()["list"]

output

{'city': ['New York', 'Paris', 'London', 'New Delhi', 'Tokio', 'Wuhan'],
 'country': ['India', 'France', 'Italy', 'Japan', 'South Korea', 'Germany'],
 'organisation': ['ITC', 'Google', 'Facebook', 'TATA', 'Amazon', 'Ford']}
Sign up to request clarification or add additional context in comments.

2 Comments

getting error UnexpectedCharacters: No terminal matches '"' in the current parser context, at line 5 col 2 "7 KARNES": ["7 KARNES"], ^ Expected one of: * RBRACE * COMMA while running the code on larger data set where I also had non duplicate keys
can you provide me with a larger data set and I'll take a look. I would expect it to work with any valid "bits" of JSON. If the bits are malformed it will cause an issue

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.