1

I have a dataframe looking like this:

columns: a, b
entries: [[1,2],[3,2],[1,3]]

I want to transform it into a dataframe with max(a+b) columns such that every entry in the range(a,a+b) is an 1 and every other one is a 0. The example would look like this then:

columns: 1, 2, 3, 4, 5
entires: [[1,1,1,0,0],[0,0,1,1,1],[1,1,1,1,0]]

Is there any easy way to do this in python, preferably with pandas? I can do it samplewise with a for loop but that is very time consuming and ugly.

0

3 Answers 3

3

Construct a new dataframe using np.repeat and np.range.

n = df.sum(1).max()
df_out = pd.DataFrame(np.repeat([np.arange(1,n+1)], len(df), axis=0), columns=np.arange(1,n+1))
df_out = (df_out.ge(df.a, axis=0) & df_out.le(df.sum(1), axis=0)).astype(int)

Out[233]:
   1  2  3  4  5
0  1  1  1  0  0
1  0  0  1  1  1
2  1  1  1  1  0

Timing:

Surprisingly, it is faster than get_dummies on dataframe with big number of rows.

Sample:

df = pd.concat([df]*10000, ignore_index=True)

In [190]: %timeit df.apply(lambda x: '|'.join(map(str, range(x['a'], x['a'] + x['b'] + 1))), axis=1).str.get_dummies()
845 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [244]: %%timeit
 ...: n = df.sum(1).max()
 ...: df_out = pd.DataFrame(np.repeat([np.arange(1,n+1)], len(df), axis=0), columns=np.arange(1,n+1))
 ...: (df_out.ge(df.a, axis=0) & df_out.le(df.sum(1), axis=0)).astype(int)
 ...:
 ...:
3.35 ms ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Shubham solution:

In [240]: %%timeit
     ...: m = np.arange(1, df.max().sum())
     ...: a = np.tile(m, (len(df), 1))
     ...: pd.DataFrame((df.to_numpy()[:, 0, None] <= a) &
     ...:              (a <= df.sum(1).to_numpy()[:, None]), dtype='int', columns=m)
     ...:
1.79 ms ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

2 Comments

Why not vectorize that df_out.agg(lambda x: (x >= df.a.values) & (x <= df.sum(1))).astype(int)?
Nice idea @Andy ..Can you please test my solution :)
2

We can try numpy broadcasting:

m = np.arange(1, df.max().sum())
a = np.tile(m, (len(df), 1))

a = pd.DataFrame((df.to_numpy()[:, 0, None] <= a) & 
                 (a <= df.sum(1).to_numpy()[:, None]), dtype='int', columns=m)

Result:

   1  2  3  4  5
0  1  1  1  0  0
1  0  0  1  1  1
2  1  1  1  1  0

2 Comments

your is the fastest :) +1
Thanks for testing @AndyL ...liked your answer already upvoted ;)
1

Use Series.str.get_dummies:

df = df.apply(lambda x: '|'.join(map(str, range(x['a'], x['a'] + x['b'] + 1))), axis=1).str.get_dummies()
print (df)
   1  2  3  4  5
0  1  1  1  0  0
1  0  0  1  1  1
2  1  1  1  1  0

1 Comment

Thanks. It seems to work, the only problem is that the columns are named 1, 10, 11,..., 2, 21,... Any solutions for that or should I do it by hand?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.