1

I'll try to make a dataframe with this data:

test1   test2                 test3
test    [test1, test2]        [testbelongsto1, testbelongst2]

To something like this:

test1   test2                 test3
test    test1                 testbelongsto1
test    test2                 testbelongsto2

I found this question answer https://stackoverflow.com/a/38652414 Looks exactly what I need right? There are alot questions which answer my question..

However, whatever I try i'm stuck with this error:

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

with this function (see link):

 def explode(self, df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

Important note! the date comes from read_csv function. The columns I need to explode are strings, so I wrote this piece of code to convert them to lists:

   df['users'] = df['users'].apply(literal_eval)

Tried everything with converting from dtype to saving them in other formats. But nothing solves the issue...

Please help

UPDATE: A 'real' dataset example of a few rows is displayed below: 'test2' => 'users' and 'test3' => 'interests', the arrays are the same size.

{'index': [0, 1, 2, 3, 4], 'Unnamed: 0': [0, 1, 4, 5, 6], 'users': ['[1, 1, 28, 28, 68]', '[1, 1, 16]', '[32, 37, 66, 67, 54, 117]', '[31, 37, 66, 67, 100, 113, 117]', '[32, 37, 66, 67, 54, 117]'], 'interests': ['[set(), set(), set(), set(), set()]', '[set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]', '[set(), set(), set(), set(), set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]']}

UPDATE 2: Ok this is exactly what I try to want. Current data I got now:

`
index       lift        confidence         interests         users
0                                          {333, 333}        1   
0                                          set()             22
0                                          set()             77
0           0           0.75               set()             88
4                                          set()             33
4           3           0.50               set()             44
`

So it seems like only the last of each iteration gets added. This is what I want:

`
index       lift        confidence         interests         users
0           88          0.33               344,              1  
0           88          0.33               333               1   
0           88          0.33               set()             22
0           88          0.33               set()             77
0           88          0.33               set()             88
4           38          0.50               set()             33
4           38          0.50               set()             44
`

So what I want is that each data row (serie) is repeated per user and that interests per user are aswell.

5
  • Can you try upgrade pandas/numpy to last versions? Because it seems like bug... Commented Jul 22, 2017 at 17:35
  • 1
    Please post df.reset_index().head().to_dict('list') so we can see an unambiguous representation of a few rows of your DataFrame. Maybe then we'll be able to reproduce the error you are seeing. Commented Jul 22, 2017 at 17:55
  • @jezrael I did try that now, still getting the error Commented Jul 22, 2017 at 18:17
  • I only try, sorry. And MaxU answer works? Commented Jul 22, 2017 at 18:18
  • @unutbu Added a 'real' dataset to my original answer. Commented Jul 22, 2017 at 18:21

1 Answer 1

1

If you can trust your data does not contain malicious strings then you could convert the strings into Python objects using eval. Be very wary though -- eval'ing malicious strings can in theory run arbitrary code on your computer!

Having highlighted the danger of eval, you could parse and reshape your DataFrame using the apply(pd.Series) trick:

import pandas as pd

df = pd.DataFrame({'test': [0, 1, 4, 5, 6], 'test2': [0, 10, 40, 50, 60], 'users': ['[1, 1, 28, 28, 68]', '[1, 1, 16]', '[32, 37, 66, 67, 54, 117]', '[31, 37, 66, 67, 100, 113, 117]', '[32, 37, 66, 67, 54, 117]'], 'interests': ['[set(), set(), set(), set(), set()]', '[set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]', '[set(), set(), set(), set(), set(), set(), set()]', '[set(), set(), set(), set(), {1535, 1542, 1527}, set()]']})

for col in df.columns.difference(['test', 'test2']):
    df[col] = df[col].apply(eval)

interests = df['interests'].apply(pd.Series)
interests = interests.stack().apply(lambda x: pd.Series(list(x)))
users = df['users'].apply(pd.Series)
users = users.stack()

result = pd.concat({'users': users, 'interests':interests}, axis=1)
result = result.stack() 
result['users'] = result['users'].ffill()
result.index = result.index.droplevel(level=[1,2])
result = df.drop(['interests','users'], axis=1).join(result)
print(result)

yields

   test  test2  interests  users
0     0      0        NaN    1.0
0     0      0        NaN    1.0
0     0      0        NaN   28.0
0     0      0        NaN   28.0
0     0      0        NaN   68.0
1     1     10        NaN    1.0
1     1     10        NaN    1.0
1     1     10        NaN   16.0
2     4     40        NaN   32.0
2     4     40        NaN   37.0
2     4     40        NaN   66.0
2     4     40        NaN   67.0
2     4     40     1535.0   54.0
2     4     40     1542.0   54.0
2     4     40     1527.0   54.0
2     4     40        NaN  117.0
3     5     50        NaN   31.0
3     5     50        NaN   37.0
3     5     50        NaN   66.0
3     5     50        NaN   67.0
3     5     50        NaN  100.0
3     5     50        NaN  113.0
3     5     50        NaN  117.0
4     6     60        NaN   32.0
4     6     60        NaN   37.0
4     6     60        NaN   66.0
4     6     60        NaN   67.0
4     6     60     1535.0   54.0
4     6     60     1542.0   54.0
4     6     60     1527.0   54.0
4     6     60        NaN  117.0

The main idea is to use apply(pd.Series) to "explode" the lists into columns:

In [572]: interests = df['interests'].apply(pd.Series); interests
Out[572]: 
    0   1   2    3                   4    5    6
0  {}  {}  {}   {}                  {}  NaN  NaN
1  {}  {}  {}  NaN                 NaN  NaN  NaN
2  {}  {}  {}   {}  {1535, 1542, 1527}   {}  NaN
3  {}  {}  {}   {}                  {}   {}   {}
4  {}  {}  {}   {}  {1535, 1542, 1527}   {}  NaN

Since you wish to "explode" the sets as well, apply the pd.Series trick a second time:

In [573]: interests = interests.stack().apply(lambda x: pd.Series(list(x))); interests
Out[573]: 
          0       1       2
0 0     NaN     NaN     NaN
  1     NaN     NaN     NaN
  2     NaN     NaN     NaN
  3     NaN     NaN     NaN
  4     NaN     NaN     NaN
1 0     NaN     NaN     NaN
  1     NaN     NaN     NaN
  2     NaN     NaN     NaN
2 0     NaN     NaN     NaN
  1     NaN     NaN     NaN
  2     NaN     NaN     NaN
  3     NaN     NaN     NaN
  4  1535.0  1542.0  1527.0
  ...

After doing the same for the users column, combine both DataFrames into one:

result = pd.concat({'users': users, 'interests':interests}, axis=1)

Move the inner column index level to the index, and forward-fill the users column to propage the users values when the user has multiple interests:

result = result.stack() 
result['users'] = result['users'].ffill()
#        interests  users
# 0 0 0        NaN    1.0
#   1 0        NaN    1.0
#   2 0        NaN   28.0
#   3 0        NaN   28.0
#   4 0        NaN   68.0
# 1 0 0        NaN    1.0
#   1 0        NaN    1.0
#   2 0        NaN   16.0
# 2 0 0        NaN   32.0
#   1 0        NaN   37.0
#   2 0        NaN   66.0
#   3 0        NaN   67.0
#   4 0     1535.0   54.0
#     1     1542.0   54.0
#     2     1527.0   54.0
# ...

Finally, drop the 2 inner-most index levels and join the result back into df:

result.index = result.index.droplevel(level=[1,2])
result = df.drop(['interests','users'], axis=1).join(result)
Sign up to request clarification or add additional context in comments.

8 Comments

Thank you for your answer, what is the column 'test'? Its the same value as "unnamed: 0", but can't access the column ;) @unutbu
solved that issue: df['index'] = df['Unnamed: 0'] i try now your code
I get the following error TypeError: eval() arg 1 must be a string, bytes or code object @unutbu I also got other data beside these two columns, is it possible to repeat these columns, when I try to do that now they just empty?
Ah, other columns that you don't want eval'd should be treated like test. I'll modify the example above to show what I mean.
I edited my question again, I try to make it clear as possible what I want :) @unutbu
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.