2

I have a Pandas Dataframe that looks like this :

user    items
1       ["product1", "product2", "product3"]
2       ["product5", "product7", "product2"]
3       ["product1", "product4", "product5"]

I have 2 millions users that each have a list of 100 products. I need to transform my Dataframe this way :

user    item_1        item_2        item_3
1       "product1"    "product2"    "product3"
2       "product5"    "product7"    "product2"
3       "product1"    "product4"    "product5"

Does anyone have a "pythonic", quick way to do so ? I don't want to go through for loops, it takes too much time.

Thank you

2 Answers 2

3

You can reconstruct with df['items'].values.tolist() and join.
I went this direction because it's faster than apply.

Considering the large size of your data, you'll want this instead.

df.drop('items', 1).join(
    pd.DataFrame(df['items'].values.tolist(), df.index).rename(
        columns=lambda x: 'item_{}'.format(x + 1)
    )
)

   user    item_1    item_2    item_3
0     1  product1  product2  product3
1     2  product5  product7  product2
2     3  product1  product4  product5

We can shave a bit of time off of this with

items_array = np.array(df['items'].values.tolist())
cols = np.core.defchararray.add(
    'item_', np.arange(1, items_array.shape[1] + 1).astype(str)
)
pd.DataFrame(
    np.column_stack([df['user'].values, items_array]),
    columns=np.append('user', cols)
)

Timing

%timeit df[['user']].join(df['items'].apply(pd.Series).add_prefix('item_'))
%timeit df.drop('items', 1).join(pd.DataFrame(df['items'].values.tolist(), df.index).rename(columns=lambda x: 'item_{}'.format(x + 1)))

1000 loops, best of 3: 1.8 ms per loop
1000 loops, best of 3: 1.34 ms per loop

%%timeit
items_array = np.array(df['items'].values.tolist())
cols = np.core.defchararray.add(
    'item_', np.arange(1, items_array.shape[1] + 1).astype(str)
)
pd.DataFrame(
    np.column_stack([df['user'].values, items_array]),
    columns=np.append('user', cols)
)

10000 loops, best of 3: 188 µs per loop

larger data

n = 20000
items = ['A%s' % i for i in range(1000)]
df = pd.DataFrame(dict(
        user=np.arange(n),
        items=np.random.choice(items, (n, 100)).tolist()
    ))

%timeit df[['user']].join(df['items'].apply(pd.Series).add_prefix('item_'))
%timeit df.drop('items', 1).join(pd.DataFrame(df['items'].values.tolist(), df.index).rename(columns=lambda x: 'item_{}'.format(x + 1)))

1 loop, best of 3: 3.22 s per loop
1 loop, best of 3: 492 ms per loop

%%timeit
items_array = np.array(df['items'].values.tolist())
cols = np.core.defchararray.add(
    'item_', np.arange(1, items_array.shape[1] + 1).astype(str)
)
pd.DataFrame(
    np.column_stack([df['user'].values, items_array]),
    columns=np.append('user', cols)
)

1 loop, best of 3: 389 ms per loop
Sign up to request clarification or add additional context in comments.

2 Comments

I tried it on 200 lines, it works. Both methods took too much time and I needed to go. i'll run this tomorrow and come back to tell you the run time. Btw, I actually have 100 products, not 30
Well it took 2.25 seconds. Thank you very much ! :)
3

You can try:

df[['user']].join(df['items'].apply(pd.Series).add_prefix('item_'))

Should yield:

#    user    item_0    item_1    item_2
# 0     1  product1  product2  product3
# 1     2  product5  product7  product2
# 2     3  product1  product4  product5

I hope this helps.

1 Comment

Thanks Abdou ! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.