2

I have a dataframe like this,

>>> import pandas as pd
>>> data = {
    'user_id': [1, 1, 1, 2, 2, 3, 3, 4, 4, 4],
    'movie_id': [0, 1, 2, 0, 1, 2, 3, 2, 3, 4]
}
>>> df = pd.DataFrame(data)
>>> df
   user_id  movie_id
0        1         0
1        1         1
2        1         2
3        2         0
4        2         1
5        3         2
6        3         3
7        4         2
8        4         3
9        4         4

I wonder how many people liked the second movie after they liked the first movie. Or liked the third movie after you liked the second movie. Etc. Here is my expected output,

[[0., 2., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 2., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0.]]

For instance, movie_id=1 liked two times after they liked movie_id=0, so matrix[0][1]=2 and matrix[1][0]=2. OK, how I found this result? user_id=1 liked movie_id=0, movie_id=1 and movie_id=2 by respectively. Also, user_id=2 liked movie_id=0 and movie_id=1 by respectively. So, matrix[0][1]=2

I tried this one, that returns incorrect output and very slow working in big dataframe.

import numpy as np

item = dict()

def cross(a):
    for i in a:
        for j in a:
            if i == j:
                continue
            if (i, j) in item.keys():
                item[(i, j)] += 1
            else:
                item[(i, j)] = 1

_ = df.groupby('user_id')['movie_id'].apply(cross)

length = df['movie_id'].nunique()
res = np.zeros([length, length])

for k, v in item.items():
    res[k] = v

Any idea? Thanks in advance.

6
  • Have you read the Pandas docs? Commented Jan 7, 2020 at 21:32
  • Yeap, but I couldn't find what I was looking for. Do you know any part that can help? Commented Jan 7, 2020 at 21:35
  • what do you mean by big dataframe? 10 different movies and millions of users, 1000 different movies and 1000 users? what is the maximum number of films for one user? this may change the way to solve your problem Commented Jan 7, 2020 at 21:53
  • Data contains 47,396 unique movies 162,414 unique users and 15,630,129 rows. And the maximum number of votes cast by a user is 12,802 and the average number of votes cast is 96. Commented Jan 7, 2020 at 21:59
  • but movie_id == 0 was NOT liked AFTER movie_id == 1 so why is matrix[1][0] == 2? Or is the word AFTER not being used appropriately here? Commented Jan 7, 2020 at 22:06

2 Answers 2

2

You can do the following:

# add row_numbers as a column
df.reset_index(inplace=True) 

# merge df on itself
df2 = df.merge(df, how='inner', on='user_id')

# remove some entries, keep only pairs where movie_id_x was liked before movie_id_y
df2 = df2[df2['index_x']<df2['index_y']].drop(['index_x','index_y'], axis=1)

# use pivot table to make matrix
df3 = df2.pivot_table(index='movie_id_x',columns='movie_id_y', values='user_id', aggfunc='count')

# UPD: add empty rows for movies which were removed
ids = df['movie_id'].unique()
df3 = df3.reindex(ids)
df3 = df3.reindex(ids, axis=1)
df3 = df3.fillna(0)

# convert result from dataframe to array if necessary
res = np.array(df3)

Result:

print(res)
[[0 2 1 0 0]
 [0 0 1 0 0]
 [0 0 0 2 1]
 [0 0 0 0 1]
 [0 0 0 0 0]]

Faster version (with sparse matrix)

The idea is that your matrix is actually sparse and it takes a lot of memory to store it in dense form (especially in form of pandas dataframe). So it is reasonable to store it like sparse matrix. Approach was found here.

# add row_numbers as a column
df.reset_index(inplace=True) 

# merge df on itself
df2 = df.merge(df, how='inner', on='user_id')

# remove some entries, keep only pairs where movie_id_x was liked before movie_id_y
df2 = df2[df2['index_x']<df2['index_y']].drop(['index_x','index_y'], axis=1)

# use groupby to count movie pairs
df2 = df2.groupby(['movie_id_x','movie_id_y'])['user_id'].count().reset_index()

# create pivot as sparse matrix
movies_t = CategoricalDtype(sorted(df['movie_id'].unique()), ordered=True)
row = df2['movie_id_x'].astype(movies_t).cat.codes
col = df2['movie_id_y'].astype(movies_t).cat.codes
sparse_matrix = csr_matrix((df2["user_id"], (row, col)), \
                       shape=(movies_t.categories.size, movies_t.categories.size))

# convert sparse to dense if needed
res = sparse_matrix.todense()
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for answer. I updated my question on @piRSquared warning. Would you like to look again? I would love to.
Added new (more faster) version
2

Update for changed question per comment below:

Using merge and pd.crosstab with np.triu:

df_m = df.merge(df, on='user_id').query('movie_id_x <= movie_id_y')
arr = pd.crosstab(df_m['movie_id_x'], df_m['movie_id_y']).to_numpy()
arr = np.triu(arr, 1)
arr

Output:

array([[0, 2, 1, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 2, 1],
       [0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0]], dtype=int64)

Using itertools and combinations:

from itertools import combinations

s = df.groupby('user_id')['movie_id'].apply(lambda x: pd.Series(combinations(x,2)))\
      .value_counts()

arr = pd.DataFrame(s, index=pd.MultiIndex.from_tuples(s.index))['movie_id'].unstack()\
        .reindex(index=[0,1,2,3,4], columns=[0,1,2,3,4]).fillna(0).to_numpy()

arr

Output:

array([[0., 2., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 2., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.]])

Update using @Stepan approach, using merge, and pd.crosstab:

df_m = df.merge(df, on='user_id').query('movie_id_x != movie_id_y')
pd.crosstab(df_m['movie_id_x'], df_m['movie_id_y']).to_numpy()

Output:

array([[0, 2, 1, 0, 0],
       [2, 0, 1, 0, 0],
       [1, 1, 0, 2, 1],
       [0, 0, 2, 0, 1],
       [0, 0, 1, 1, 0]], dtype=int64)

You can try this:

from itertools import permutations

s = df.groupby('user_id')['movie_id'].apply(lambda x: pd.Series(permutations(x,2)))\
      .value_counts()

arr = pd.DataFrame(s, index=pd.MultiIndex.from_tuples(s.index))['movie_id'].unstack()\
        .reindex(index=[0,1,2,3,4], columns=[0,1,2,3,4]).fillna(0).to_numpy()

arr

Output:

array([[0., 2., 1., 0., 0.],
       [2., 0., 1., 0., 0.],
       [1., 1., 0., 2., 1.],
       [0., 0., 2., 0., 1.],
       [0., 0., 1., 1., 0.]])

2 Comments

Thanks for answer. I updated my question on @piRSquared warning. Would you like to look again? I would love to.
@E.Zeytinci Updated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.