How to create matrix from same column with relation between previous element in Pandas?

Question

I have a dataframe like this,

>>> import pandas as pd
>>> data = {
    'user_id': [1, 1, 1, 2, 2, 3, 3, 4, 4, 4],
    'movie_id': [0, 1, 2, 0, 1, 2, 3, 2, 3, 4]
}
>>> df = pd.DataFrame(data)
>>> df
   user_id  movie_id
0        1         0
1        1         1
2        1         2
3        2         0
4        2         1
5        3         2
6        3         3
7        4         2
8        4         3
9        4         4

I wonder how many people liked the second movie after they liked the first movie. Or liked the third movie after you liked the second movie. Etc. Here is my expected output,

[[0., 2., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 2., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0.]]

For instance, movie_id=1 liked two times after they liked movie_id=0, so matrix[0][1]=2 and matrix[1][0]=2. OK, how I found this result? user_id=1 liked movie_id=0, movie_id=1 and movie_id=2 by respectively. Also, user_id=2 liked movie_id=0 and movie_id=1 by respectively. So, matrix[0][1]=2

I tried this one, that returns incorrect output and very slow working in big dataframe.

import numpy as np

item = dict()

def cross(a):
    for i in a:
        for j in a:
            if i == j:
                continue
            if (i, j) in item.keys():
                item[(i, j)] += 1
            else:
                item[(i, j)] = 1

_ = df.groupby('user_id')['movie_id'].apply(cross)

length = df['movie_id'].nunique()
res = np.zeros([length, length])

for k, v in item.items():
    res[k] = v

Any idea? Thanks in advance.

Yeap, but I couldn't find what I was looking for. Do you know any part that can help? — E. Zeytinci
– E. Zeytinci, Commented Jan 7, 2020 at 21:35
what do you mean by big dataframe? 10 different movies and millions of users, 1000 different movies and 1000 users? what is the maximum number of films for one user? this may change the way to solve your problem — Ben.T
– Ben.T, Commented Jan 7, 2020 at 21:53
Data contains 47,396 unique movies 162,414 unique users and 15,630,129 rows. And the maximum number of votes cast by a user is 12,802 and the average number of votes cast is 96. — E. Zeytinci
– E. Zeytinci, Commented Jan 7, 2020 at 21:59
but movie_id == 0 was NOT liked AFTER movie_id == 1 so why is matrix[1][0] == 2? Or is the word AFTER not being used appropriately here? — piRSquared
– piRSquared, Commented Jan 7, 2020 at 22:06

Stepan · Accepted Answer · 2020-01-07 23:06:15Z

2

You can do the following:

# add row_numbers as a column
df.reset_index(inplace=True) 

# merge df on itself
df2 = df.merge(df, how='inner', on='user_id')

# remove some entries, keep only pairs where movie_id_x was liked before movie_id_y
df2 = df2[df2['index_x']<df2['index_y']].drop(['index_x','index_y'], axis=1)

# use pivot table to make matrix
df3 = df2.pivot_table(index='movie_id_x',columns='movie_id_y', values='user_id', aggfunc='count')

# UPD: add empty rows for movies which were removed
ids = df['movie_id'].unique()
df3 = df3.reindex(ids)
df3 = df3.reindex(ids, axis=1)
df3 = df3.fillna(0)

# convert result from dataframe to array if necessary
res = np.array(df3)

Result:

print(res)
[[0 2 1 0 0]
 [0 0 1 0 0]
 [0 0 0 2 1]
 [0 0 0 0 1]
 [0 0 0 0 0]]

Faster version (with sparse matrix)

The idea is that your matrix is actually sparse and it takes a lot of memory to store it in dense form (especially in form of pandas dataframe). So it is reasonable to store it like sparse matrix. Approach was found here.

# add row_numbers as a column
df.reset_index(inplace=True) 

# merge df on itself
df2 = df.merge(df, how='inner', on='user_id')

# remove some entries, keep only pairs where movie_id_x was liked before movie_id_y
df2 = df2[df2['index_x']<df2['index_y']].drop(['index_x','index_y'], axis=1)

# use groupby to count movie pairs
df2 = df2.groupby(['movie_id_x','movie_id_y'])['user_id'].count().reset_index()

# create pivot as sparse matrix
movies_t = CategoricalDtype(sorted(df['movie_id'].unique()), ordered=True)
row = df2['movie_id_x'].astype(movies_t).cat.codes
col = df2['movie_id_y'].astype(movies_t).cat.codes
sparse_matrix = csr_matrix((df2["user_id"], (row, col)), \
                       shape=(movies_t.categories.size, movies_t.categories.size))

# convert sparse to dense if needed
res = sparse_matrix.todense()

edited Jan 7, 2020 at 23:06

answered Jan 7, 2020 at 21:34

Stepan

1,0541 gold badge7 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

E. Zeytinci Over a year ago

Thanks for answer. I updated my question on @piRSquared warning. Would you like to look again? I would love to.

Stepan Over a year ago

Added new (more faster) version

Scott Boston · Accepted Answer · 2020-01-07 22:38:24Z

2

Update for changed question per comment below:

Using merge and pd.crosstab with np.triu:

df_m = df.merge(df, on='user_id').query('movie_id_x <= movie_id_y')
arr = pd.crosstab(df_m['movie_id_x'], df_m['movie_id_y']).to_numpy()
arr = np.triu(arr, 1)
arr

Output:

array([[0, 2, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 2, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0]], dtype=int64)

Using itertools and combinations:

from itertools import combinations

s = df.groupby('user_id')['movie_id'].apply(lambda x: pd.Series(combinations(x,2)))\
      .value_counts()

arr = pd.DataFrame(s, index=pd.MultiIndex.from_tuples(s.index))['movie_id'].unstack()\
        .reindex(index=[0,1,2,3,4], columns=[0,1,2,3,4]).fillna(0).to_numpy()

arr

Output:

array([[0., 2., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 2., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.]])

Update using @Stepan approach, using merge, and pd.crosstab:

df_m = df.merge(df, on='user_id').query('movie_id_x != movie_id_y')
pd.crosstab(df_m['movie_id_x'], df_m['movie_id_y']).to_numpy()

Output:

array([[0, 2, 1, 0, 0],
       [2, 0, 1, 0, 0],
       [1, 1, 0, 2, 1],
       [0, 0, 2, 0, 1],
       [0, 0, 1, 1, 0]], dtype=int64)

You can try this:

from itertools import permutations

s = df.groupby('user_id')['movie_id'].apply(lambda x: pd.Series(permutations(x,2)))\
      .value_counts()

arr = pd.DataFrame(s, index=pd.MultiIndex.from_tuples(s.index))['movie_id'].unstack()\
        .reindex(index=[0,1,2,3,4], columns=[0,1,2,3,4]).fillna(0).to_numpy()

arr

Output:

array([[0., 2., 1., 0., 0.],
       [2., 0., 1., 0., 0.],
       [1., 1., 0., 2., 1.],
       [0., 0., 2., 0., 1.],
       [0., 0., 1., 1., 0.]])

edited Jan 7, 2020 at 22:38

answered Jan 7, 2020 at 21:55

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

2 Comments

E. Zeytinci Over a year ago

Thanks for answer. I updated my question on @piRSquared warning. Would you like to look again? I would love to.

Scott Boston Over a year ago

@E.Zeytinci Updated.

Collectives™ on Stack Overflow

How to create matrix from same column with relation between previous element in Pandas?

2 Answers 2

Faster version (with sparse matrix)

2 Comments

Update for changed question per comment below:

Using merge and pd.crosstab with np.triu:

Using itertools and combinations:

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Faster version (with sparse matrix)

2 Comments

Update for changed question per comment below:

Using merge and pd.crosstab with np.triu:

Using itertools and combinations:

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related