binary shift of pandas series

Question

I have some boolean variables in a pandas dataframe and I need to get all unique tuples. So my idea was to create a new column of concatenated values of my variables then use pandas.DataFrame.unique() to get all unique tuples.

So my idea was to concatenate using binary developpment. For instance, for the dataframe :

import pandas as pd
df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})

I could create a column as such :

df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8

My idea was to iterate on the list of variables like this (it should be noted that on my real problem I do not know the number of columns):

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var] << ind

This however throws an error : "TypeError: unsupported operand type(s) for << : 'Series' and 'int' .

I can solve my problem with pandas.DataFrame.apply() as such :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

However, apply is (typically) slow. How can I do things more efficiently?

Thanks in advance

M

there's something odd about your code, the df['var'] should be df[var] no? — Yuca
– Yuca, Commented Apr 2, 2019 at 13:08

jezrael · Accepted Answer · 2019-04-03 06:11:48Z

1

Use this solution, only simplify, because ordereing is already swapped:

df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
print (df)
   v1  v2  v3  v4  new
0   0   0   0   0    0
1   1   0   1   1   13
2   0   0   1   1   12
3   0   1   0   1   10
4   1   1   1   1   15

Performance in 1000 rows and 4 columns:

np.random.seed(2019)

N= 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 4)))
df.columns = [f'v{x+1}' for x in df.columns]

In [60]: %%timeit
    ...: df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
113 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Yuca solution:

In [65]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:     df['added'] = df['added'] + [x<<ind for x in df[var]]
    ...: 
1.82 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original solution:

In [66]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:    df['added'] = df['added'] + df[var].apply(lambda x : x << ind )
    ...: 
3.14 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Apr 3, 2019 at 6:11

answered Apr 2, 2019 at 13:14

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Matina G Over a year ago

+1 for your solution. However, it either consumes memory (creating the array a) or repeats the operation df.values twice , so I am not sure this is the best that can be done...

jezrael Over a year ago

@MatinaG - Not understand, why twice? In my opinion this should be most optimal solution - because working with all data only once, not loop by each column separately

Matina G Over a year ago

You either stock df.values in a separate variable ( a ) or, if you don't , you have to do something like df['new'] = df.values.dot(1 << np.arange(df.values.shape[-1])) .. Thank for the suggestion

jezrael Over a year ago

@MatinaG - I get better solution with avoid double df.values, also added timings for compare performance.

Charles Landau · Accepted Answer · 2019-04-02 13:11:40Z

1

Getting unique rows is the same operation as drop_duplicates. (By finding all the duplicate rows and dropping them it leaves only unique rows.)

df[["v2","v3","v4"]].drop_duplicates()

answered Apr 2, 2019 at 13:11

Charles Landau

4,2751 gold badge13 silver badges25 bronze badges

2 Comments

Matina G Over a year ago

Sure, thanks a lot. I gave you plus 1 but since the question was mostly about binary shift on pandas series I will have to accept as correct an answer on that one.

Charles Landau Over a year ago

That makes sense to me

Yuca · Accepted Answer · 2019-04-02 13:15:08Z

Answering your question of a more efficient alternative, I found that list comprehension does help you a bit:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + [x<<ind for x in df[var]]

308 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
322 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
316 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So 315 µs vs :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

500 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
503 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
481 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As a disclaimer, I don't agree with the value of the sum, but that's a different topic :)

Collectives™ on Stack Overflow

binary shift of pandas series

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related