Pandas DataFrame manipulation from numerical into binary

Question

I have a dataframe looking like this:

columns: a, b
entries: [[1,2],[3,2],[1,3]]

I want to transform it into a dataframe with max(a+b) columns such that every entry in the range(a,a+b) is an 1 and every other one is a 0. The example would look like this then:

columns: 1, 2, 3, 4, 5
entires: [[1,1,1,0,0],[0,0,1,1,1],[1,1,1,1,0]]

Is there any easy way to do this in python, preferably with pandas? I can do it samplewise with a for loop but that is very time consuming and ugly.

Andy L. · Accepted Answer · 2020-09-09 10:16:30Z

3

Construct a new dataframe using np.repeat and np.range.

n = df.sum(1).max()
df_out = pd.DataFrame(np.repeat([np.arange(1,n+1)], len(df), axis=0), columns=np.arange(1,n+1))
df_out = (df_out.ge(df.a, axis=0) & df_out.le(df.sum(1), axis=0)).astype(int)

Out[233]:
   1  2  3  4  5
0  1  1  1  0  0
1  0  0  1  1  1
2  1  1  1  1  0

Timing:

Surprisingly, it is faster than get_dummies on dataframe with big number of rows.

Sample:

df = pd.concat([df]*10000, ignore_index=True)

In [190]: %timeit df.apply(lambda x: '|'.join(map(str, range(x['a'], x['a'] + x['b'] + 1))), axis=1).str.get_dummies()
845 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [244]: %%timeit
 ...: n = df.sum(1).max()
 ...: df_out = pd.DataFrame(np.repeat([np.arange(1,n+1)], len(df), axis=0), columns=np.arange(1,n+1))
 ...: (df_out.ge(df.a, axis=0) & df_out.le(df.sum(1), axis=0)).astype(int)
 ...:
 ...:
3.35 ms ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Shubham solution:

In [240]: %%timeit
     ...: m = np.arange(1, df.max().sum())
     ...: a = np.tile(m, (len(df), 1))
     ...: pd.DataFrame((df.to_numpy()[:, 0, None] <= a) &
     ...:              (a <= df.sum(1).to_numpy()[:, None]), dtype='int', columns=m)
     ...:
1.79 ms ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Sep 9, 2020 at 10:16

answered Sep 9, 2020 at 9:50

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

yatu Over a year ago

Why not vectorize that df_out.agg(lambda x: (x >= df.a.values) & (x <= df.sum(1))).astype(int)?

Shubham Sharma Over a year ago

Nice idea @Andy ..Can you please test my solution :)

Shubham Sharma · Accepted Answer · 2020-09-09 10:05:10Z

2

We can try numpy broadcasting:

m = np.arange(1, df.max().sum())
a = np.tile(m, (len(df), 1))

a = pd.DataFrame((df.to_numpy()[:, 0, None] <= a) & 
                 (a <= df.sum(1).to_numpy()[:, None]), dtype='int', columns=m)

Result:

   1  2  3  4  5
0  1  1  1  0  0
1  0  0  1  1  1
2  1  1  1  1  0

answered Sep 9, 2020 at 10:05

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

2 Comments

Andy L. Over a year ago

your is the fastest :) +1

Shubham Sharma Over a year ago

Thanks for testing @AndyL ...liked your answer already upvoted ;)

jezrael · Accepted Answer · 2020-09-09 09:31:39Z

1

Use Series.str.get_dummies:

df = df.apply(lambda x: '|'.join(map(str, range(x['a'], x['a'] + x['b'] + 1))), axis=1).str.get_dummies()
print (df)
   1  2  3  4  5
0  1  1  1  0  0
1  0  0  1  1  1
2  1  1  1  1  0

answered Sep 9, 2020 at 9:31

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

1 Comment

munichmath Over a year ago

Thanks. It seems to work, the only problem is that the columns are named 1, 10, 11,..., 2, 21,... Any solutions for that or should I do it by hand?

Collectives™ on Stack Overflow

Pandas DataFrame manipulation from numerical into binary

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related