12

I have a data frame which have two columns in JSON format, like this:

   author          biblio                                 series                             
Mehrdad Vahabi    {'volume': 68, 'month': 'January',      {'handle':'RePEc:aka:aoecon', 'name': 'Oeconomica'}
                   'name': 'János Kornai', 
                   'issue': 's', 'handle': 
                   'n:v:68:y:2018:i', 
                   'year': '2018', 
                   'pages': '27-52', 'doi': ''}
Michael Bailey     {'c_date': '2017', 'number':           {'handle': '', 'name': ''}
                    '23608', 'handle': 'RePEc:nbr:
                    nberwo:23608', 'name': 'Measuring'}

I Want to my data frame looks like this:

author          biblio.volume  biblio.month    biblio.name  biblio.issue  biblio.handle    bibilio.year   biblio.pages   biblio.doi  biblio.c_date    bibi¡lio.number    series.handle   series.name
Mehrdad Vahabi  68             January         János Kornai s             n:v:68:y:2018:i  2018           27-52                      NA               NA                RePEc:aka:aoecon  Oeconomica
Michael Bailey  NA             Na              Meausuring   NA            nberwo:23608     NA             NA             NA          2017        23608

I try do it using the answers in this question, but no one works for me.

How can I do it?

[EDIT] Here is a sample of the data

[EDIT]

Following the @jezrael solution I get this:

df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns

df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)


Traceback (most recent call last):
  File "dfs.py", line 8, in <module>
    df1.columns = 'bibliographic.'+ df1.columns
  File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/range.py", line 583, in _evaluate_numeric_binop
    other = self._validate_for_numeric_binop(other, op, opstr)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/base.py", line 3961, in 
_validate_for_numeric_binop
     raise TypeError("can only perform ops with scalar values")
TypeError: can only perform ops with scalar values

And with json_normalize:

Traceback (most recent call last):
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dfs.py", line 7, in <module>
    df = json_normalize(d)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/io/json/normalize.py", line 192, in json_normalize
    if any([isinstance(x, dict) for x in compat.itervalues(data[0])]):
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

Following the @Jhon H solution, I get this:

Traceback (most recent call last):
  File "dfs.py", line 7, in <module>
    jsonSeries = df[['bibliographic']].tolist()
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'tolist'
1
  • Is json input of DataFrame ? Maybe is possible create df from json without columns of dict. Commented Mar 3, 2018 at 6:14

2 Answers 2

16

Create for each dict column new DataFrame by constructor and last concat all together:

df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns

df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
           author biblio.c_date biblio.doi           biblio.handle  \
0  Mehrdad Vahabi           NaN                    n:v:68:y:2018:i   
1  Michael Bailey          2017        NaN  RePEc:nbr:nberwo:23608   

  biblio.issue biblio.month   biblio.name biblio.number biblio.pages  \
0            s      January  Janos Kornai           NaN        27-52   
1          NaN          NaN     Measuring         23608          NaN   

   biblio.volume biblio.year     series.handle series.name  
0           68.0        2018  RePEc:aka:aoecon  Oeconomica  
1            NaN         NaN                           

EDIT:

If input is json is possible use json_normalize:

from pandas.io.json import json_normalize

d = [{"author":"Mehrdad Vahabi","biblio":{"volume":68,"month":"January","name":"Janos Kornai","issue":"s","handle":"n:v:68:y:2018:i","year":"2018","pages":"27-52","doi":""},"series":{"handle":"RePEc:aka:aoecon","name":"Oeconomica"}},{"author":"Michael Bailey","biblio":{"c_date":"2017","number":"23608","handle":"RePEc:nbr:nberwo:23608","name":"Measuring"},"series":{"handle":"","name":""}}]

df = json_normalize(d)
print (df)
           author biblio.c_date biblio.doi           biblio.handle  \
0  Mehrdad Vahabi           NaN                    n:v:68:y:2018:i   
1  Michael Bailey          2017        NaN  RePEc:nbr:nberwo:23608   

  biblio.issue biblio.month   biblio.name biblio.number biblio.pages  \
0            s      January  Janos Kornai           NaN        27-52   
1          NaN          NaN     Measuring         23608          NaN   

   biblio.volume biblio.year     series.handle series.name  
0           68.0        2018  RePEc:aka:aoecon  Oeconomica  
1            NaN         NaN                                

EDIT: There is problem your dictionaries are strings, so first is necessary use ast.literal_eval for convert:

import ast

df = pd.read_csv('probe.csv')
#print (df)

df1 = pd.DataFrame(df['bibliographic'].apply(ast.literal_eval).values.tolist())
df1.columns = 'bibliographic.'+ df1.columns

df2 = pd.DataFrame(df['series'].apply(ast.literal_eval).values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['bibliographic','series'])
df = pd.concat([df[col], df1, df2],axis=1)
Sign up to request clarification or add additional context in comments.

3 Comments

I have editing and update the question with the errors your solution give to me.
I have update the question with a data sample in dropbox.
with df1.columns = 'bibliographic.'+ df1.columns.astype(str) and df2.columns = 'series.'+ df2.columns.astype(str) only change the bibliographic and series position.
5

You need to process the columns individually and join them all together to get the format that you need. Here is a simple example that you could follow

import pandas as pd
records = [{'col1':'v1','col2':{'a1':1,'b1':1},'col3':{'c1':1,'d1':1}},
           {'col1':'v2','col2':{'a1':2,'b1':2},'col3':{'c1':2,'d1':2}}]
sample_df = pd.DataFrame(records)
sample_df


    col1      col2                col3
0   v1  {'a1': 1, 'b1': 1}  {'c1': 1, 'd1': 1}
1   v2  {'a1': 2, 'b1': 2}  {'c1': 2, 'd1': 2}


col2_expanded = sample_df.col2.apply(lambda x:pd.Series(x))
col2_expanded.columns = ['{}.{}'.format('col2',i) for i in col2_expanded]
col2_expanded


    col2.a1     col2.b1
0   1             1
1   2             2

col3_expanded = sample_df.col3.apply(lambda x:pd.Series(x))
col3_expanded.columns = ['{}.{}'.format('col3',i) for i in col3_expanded]
col3_expanded


    col3.c1     col3.d1
0   1            1
1   2            2

final = pd.concat([sample_df[['col1']],col2_expanded,col3_expanded],axis=1)
final

    col1    col2.a1     col2.b1     col3.c1     col3.d1
0   v1       1           1            1         1
1   v2       2           2            2         2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.