0

I have some boolean variables in a pandas dataframe and I need to get all unique tuples. So my idea was to create a new column of concatenated values of my variables then use pandas.DataFrame.unique() to get all unique tuples.

So my idea was to concatenate using binary developpment. For instance, for the dataframe :

import pandas as pd
df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})

I could create a column as such :

df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8

My idea was to iterate on the list of variables like this (it should be noted that on my real problem I do not know the number of columns):

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var] << ind

This however throws an error : "TypeError: unsupported operand type(s) for << : 'Series' and 'int' .

I can solve my problem with pandas.DataFrame.apply() as such :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

However, apply is (typically) slow. How can I do things more efficiently?

Thanks in advance

M

2
  • there's something odd about your code, the df['var'] should be df[var] no? Commented Apr 2, 2019 at 13:08
  • Yes, of course, thanks (fixed this)! Commented Apr 2, 2019 at 13:15

3 Answers 3

1

Use this solution, only simplify, because ordereing is already swapped:

df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
print (df)
   v1  v2  v3  v4  new
0   0   0   0   0    0
1   1   0   1   1   13
2   0   0   1   1   12
3   0   1   0   1   10
4   1   1   1   1   15

Performance in 1000 rows and 4 columns:

np.random.seed(2019)

N= 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 4)))
df.columns = [f'v{x+1}' for x in df.columns]

In [60]: %%timeit
    ...: df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
113 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Yuca solution:

In [65]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:     df['added'] = df['added'] + [x<<ind for x in df[var]]
    ...: 
1.82 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original solution:

In [66]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:    df['added'] = df['added'] + df[var].apply(lambda x : x << ind )
    ...: 
3.14 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

4 Comments

+1 for your solution. However, it either consumes memory (creating the array a) or repeats the operation df.values twice , so I am not sure this is the best that can be done...
@MatinaG - Not understand, why twice? In my opinion this should be most optimal solution - because working with all data only once, not loop by each column separately
You either stock df.values in a separate variable ( a ) or, if you don't , you have to do something like df['new'] = df.values.dot(1 << np.arange(df.values.shape[-1])) .. Thank for the suggestion
@MatinaG - I get better solution with avoid double df.values, also added timings for compare performance.
1

Getting unique rows is the same operation as drop_duplicates. (By finding all the duplicate rows and dropping them it leaves only unique rows.)

df[["v2","v3","v4"]].drop_duplicates()

2 Comments

Sure, thanks a lot. I gave you plus 1 but since the question was mostly about binary shift on pandas series I will have to accept as correct an answer on that one.
That makes sense to me
0

Answering your question of a more efficient alternative, I found that list comprehension does help you a bit:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + [x<<ind for x in df[var]]

308 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
322 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
316 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So 315 µs vs :

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

500 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
503 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
481 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

As a disclaimer, I don't agree with the value of the sum, but that's a different topic :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.