2

I have a pandas series with list of JSON objects in string format as values. Below is an example.

sr = pd.Series(['[{"fruit": "apple", "box_a": 2}, {"fruit": "grape", "box_b": 4}]', '[{"fruit": "orange", "box_g": 2}]', '[{"fruit": "mango", "box_c": 6}, {"fruit": "grape", "box_e": 3}]'])

My objective is to find an efficient way to convert this series into a dataframe with the following structure. As a novice, I can only think of doing the transformation using nested loops, where I iterate through each row and item.

sr_df = pd.DataFrame({'fruit':['apple', 'grape', 'orange', 'mango', 'grape'], 'box':['box_a', 'box_b', 'box_g', 'box_c', 'box_e'], 'count':[2,4,2,6,3]})

I look forward to learning new approaches.

2 Answers 2

3

You can use:

  • first convert strings to python list of dictionaries by ast
  • in list comprehension create new DataFrame, set column fruit to index
  • concat and reshape by stack
  • for integer convert by astype
  • convert MultiIndex to columns and rename column

import ast

df = (pd.concat([pd.DataFrame(x).set_index('fruit') for x in sr.apply(ast.literal_eval)])
       .stack()
       .astype(int)
       .reset_index(name='count')
       .rename(columns={'level_1':'box'}))
print (df)
    fruit    box  count
0   apple  box_a      2
1   grape  box_b      4
2  orange  box_g      2
3   mango  box_c      6
4   grape  box_e      3
Sign up to request clarification or add additional context in comments.

Comments

1

Using json and itertools.chain you get something like this:

import itertools
import json
import pandas as pd

data_json = ['[{"fruit": "apple", "box_a": 2}, {"fruit": "grape", "box_b": 4}]', '[{"fruit": "orange", "box_g": 2}]', '[{"fruit": "mango", "box_c": 6}, {"fruit": "grape", "box_e": 3}]']
data = (json.loads(i) for i in data_json)
data = itertools.chain.from_iterable(data)
df = pd.DataFrame.from_records(data)
  box_a   box_b   box_c   box_e   box_g   fruit
0 2.0                 apple
1     4.0             grape
2                 2.0 orange
3         6.0         mango
4             3.0     grape

then you can set fruit as index and stack to get the result

result = df.set_index('fruit').stack().astype(int)
apple box_a   2
grape box_b   4
orange    box_g   2
mango box_c   6
grape box_e   3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.