3

I'm using python and trying to figure out how to do the following without using a loop.

I have a dataframe that has several columns including one that has a JSON objects list. What I'm trying to do is convert the JSON string column into their own columns within the dataframe. For example I have the following dataframe:

name age group
John 35 [{"testid": "001", "marks": 67}, {"testid": "002", "marks": 70}]
Ann 20 [{"testid": "001", "marks": 75}, {"testid": "002", "marks": 80}, {"testid": "003", "marks": 87}]
Emma 25 [{"testid": "001", "marks": 90}, {"testid": "002", "marks": 99}]

I want to get marks for testid = 001 and testid = 002 as follows.

name age test_id1 test_id2
John 35 67 70
Ann 20 75 80
Emma 25 90 99

Here is my dataset

[
   {
      "name":"John",
      "age":35,
      "group":[
         {
            "testid":"001",
            "marks":67
         },
         {
            "testid":"002",
            "marks":70
         }
      ]
   },
   {
      "name":"Ann",
      "age":20,
      "group":[
         {
            "testid":"001",
            "marks":75
         },
         {
            "testid":"002",
            "marks":80
         },
         {
            "testid":"003",
            "marks":87
         }
      ]
   },
   {
      "name":"Emma",
      "age":25,
      "group":[
         {
            "testid":"001",
            "marks":90
         },
         {
            "testid":"002",
            "marks":99
         }
      ]
   }
]

Any idea is highly appreciated. Thank you.

2
  • Kindly share the dataframe as code : df.to_dict('records') Commented May 14, 2021 at 22:56
  • @sammywemmy Thank you for your comment. I'have added the data set. Commented May 15, 2021 at 3:25

2 Answers 2

2

A list compreshension is handy here in pulling the data out; as a side note, if you can, possibly do the extraction, before getting the dict like data into a dataframe (more efficient to do so):

outcome = [[entry[num]['marks']
           for num in range(len(entry)) 
           if entry[num]['testid'] in ('001', '002')] 
           for entry in df.group]

print(outcome)
[[67, 70], [75, 80], [90, 99]]

Zip the data, and assign to new column names in the dataframe:

test_id1, test_id2 = zip(*outcome)

df.filter(['name', 'age']).assign(test_id1 = test_id1, test_id2 = test_id2)

   name  age  test_id1  test_id2
0  John   35        67        70
1   Ann   20        75        80
2  Emma   25        90        99
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. BTW, it throws me the following error Traceback (most recent call last): File "//HOME-SVR-001/User Directory Folder/Administrator/Desktop/testtest.py", line 27, in <module> for entry in df.group] AttributeError: 'list' object has no attribute 'group' Any idea?
No idea. df is a dataframe right? It seems your code is reading df as a list
1

See comments inline. Using apply() does the iterating for you. You just need to write the function.

data='''name|age|group
John|35|[{"testid": "001", "marks": 67}, {"testid": "002", "marks": 70}]
Ann|20|[{"testid": "001", "marks": 75}, {"testid": "002", "marks": 80}, {"testid": "003", "marks": 87}]
Emma|25|[{"testid": "001", "marks": 90}, {"testid": "002", "marks": 99}]'''
df = pd.read_csv(io.StringIO(data), sep='|', engine='python')

# create function for apply()
def expand_json(xname, x):
    for i, j in enumerate(json.loads(x), 1):
        # print(i, j)
        col = 'test_id'+str(i)
        # print(col)
        # print(j['marks'])
        df.loc[df.name==xname, col] = j['marks']
        
#dftemp is a throw away so nothing prints to the screen. The function writes to the main df

dftemp = df.apply(lambda x: expand_json(x['name'], x['group']), axis=1)
print(df)

   name  age                                                                                             group  test_id1  test_id2  test_id3
0  John   35                                  [{"testid": "001", "marks": 67}, {"testid": "002", "marks": 70}]    67.000    70.000       NaN
1   Ann   20  [{"testid": "001", "marks": 75}, {"testid": "002", "marks": 80}, {"testid": "003", "marks": 87}]    75.000    80.000    87.000
2  Emma   25                                  [{"testid": "001", "marks": 90}, {"testid": "002", "marks": 99}]    90.000    99.000       NaN

3 Comments

Thank you very much. It works. Can I only get name, age, test_id1 and test_id2 data only
Thanks! I can use df.filter() function.
On the for loop in the function you can also limit it by using [0:2]. That should limit to just first two ids. Or filter as you mention.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.