Convert (many) integer-valued rows into binary indicator columns using Pandas

Question

I am working on a task that seems to me a little like one-hot encoding, but notably different. What I want to do is take a row of integers from a Pandas DataFrame and produce a binary column with 1's at the index locations specified by the integers and 0's everywhere else. If possible, I would like to do this for many rows at the same time. So a trivial example would be given as taking

index	A	B	C
0	1	4	7
1	2	5	8
2	3	6	9

and producing

index	.0	.1	.2
0	0	0	0
1	1	0	0
2	0	1	0
3	0	0	1
4	1	0	0
5	0	1	0
6	0	0	1
7	1	0	0
8	0	1	0
9	0	0	1

I have tried

new_df = pandas.DataFrame(range(old_df.max(axis=None)+1)).isin(list(old_df.iloc[0]))

which works for a single row (the first row of the old_df in this case), but doesn't seem easily scalable to an arbitrary number of rows. Is there a built in function that does something similar to this?

maybe you have to use for-loop or use .apply() to run it on every row. — furas
– furas, Commented Aug 5 at 1:34
new_df['index']? But new_df isn't defined yet... If that's just a typo, please edit to fix it. — wjandrea
– wjandrea, Commented Aug 5 at 3:12

PaulS · Accepted Answer · 2025-08-05 14:07:37Z

Another possible solution:

pd.get_dummies(df.stack()).groupby(level=0).sum().T

First, it uses stack to pivot the data from a wide to a long format, creating a series with a MultiIndex that pairs original row and column positions. This long-format series is then fed into get_dummies to perform the one-hot encoding, converting the integer values into new binary columns. Subsequently, groupby(level=0).sum() aggregates these results by summing the binary indicators for each of the original rows (the first level of the MultiIndex). Finally, the .T attribute pivots the resulting dataframe.

Output:

Intermediates:


# (df.stack(), 
#  pd.get_dummies(df.stack()),  
#  [g for g in pd.get_dummies(df.stack()).groupby(level=0)], 
#  pd.get_dummies(df.stack()).groupby(level=0).sum())

(0  A    1
    B    4
    C    7
 1  A    2
    B    5
    C    8
 2  A    3
    B    6
    C    9
 dtype: int64,
          1      2      3      4      5      6      7      8      9
 0 A   True  False  False  False  False  False  False  False  False
   B  False  False  False   True  False  False  False  False  False
   C  False  False  False  False  False  False   True  False  False
 1 A  False   True  False  False  False  False  False  False  False
   B  False  False  False  False   True  False  False  False  False
   C  False  False  False  False  False  False  False   True  False
 2 A  False  False   True  False  False  False  False  False  False
   B  False  False  False  False  False   True  False  False  False
   C  False  False  False  False  False  False  False  False   True,
 [(0,
            1      2      3      4      5      6      7      8      9
   0 A   True  False  False  False  False  False  False  False  False
     B  False  False  False   True  False  False  False  False  False
     C  False  False  False  False  False  False   True  False  False),
  (1,
            1      2      3      4      5      6      7      8      9
   1 A  False   True  False  False  False  False  False  False  False
     B  False  False  False  False   True  False  False  False  False
     C  False  False  False  False  False  False  False   True  False),
  (2,
            1      2      3      4      5      6      7      8      9
   2 A  False  False   True  False  False  False  False  False  False
     B  False  False  False  False  False   True  False  False  False
     C  False  False  False  False  False  False  False  False   True)],
    1  2  3  4  5  6  7  8  9
 0  1  0  0  1  0  0  1  0  0
 1  0  1  0  0  1  0  0  1  0
 2  0  0  1  0  0  1  0  0  1)

lane-h-rogers · Accepted Answer · 2025-08-13 22:07:01Z

This post is intended to simply add a bit of extra detail to the answer supplied by PaulS. His solution is great, but for my use case, I need to include rows for every integer in a given range beginning from zero (not just which integers happen to appear as elements in the given DataFrame old_df). To address this, simply set a value of max_col_index and perform:

dummies = pandas.get_dummies(old_df.stack())
missing_indices = set(range(max_col_index)) - set(dummies.columns)

new_df = pandas.concat([
dummies.groupby(level=0).sum(),
pandas.DataFrame(dict.fromkeys(list(map(str,missing_indices)),0), index=dummies.groupby(level=0).sum().index)
], axis=1).T

I'm not sure if there is a more concise way to do this using the DataFrame index property (that is, fill a DataFrame with null rows on missing index values up to a given maximum value), but this works well enough for me.

Edit:

It turns out I was correct in my above intuition. A slightly slicker (more "pythonic") version can be given as:

new_df = pandas.get_dummies(old_df.stack()).groupby(level=0).sum().T.reindex(range(max_col_index), fill_value=0)

mozway · Accepted Answer · 2025-08-05 06:41:45Z

Assuming a range index in the input DataFrame (if not, run df = df.reset_index(drop=True)).

You could melt and pivot:

out = (df.rename(lambda x: x/10) # optional, just to get the 0.0/0.1/0.2
         .reset_index().melt('index').assign(x=1)
         .pivot_table(index='value', columns='index', values='x', fill_value=0)
         .convert_dtypes().rename_axis(index=None, columns=None)
      )

Or melt and join+pd.get_dummies:

tmp = (
    df.rename(lambda x: x / 10) # optional, just to get the 0.0/0.1/0.2
    .reset_index()
    .melt('index').drop(columns='variable')
)
out = (pd.get_dummies(tmp['index'], dtype='int')
         .set_axis(tmp['value'].rename(None))
      )

Or a more minimal version with unstack and pd.get_dummies:

tmp = df.unstack()
out = (pd.get_dummies(tmp.index.get_level_values(1)/10, dtype='int')
         .set_axis(tmp)
      )

Or with unstack and pd.crosstab:

tmp = df.unstack()
out = (pd.crosstab(tmp, tmp.index.get_level_values(1)/10)
         .rename_axis(index=None, columns=None)
      )

Output:

   0.0  0.1  0.2
1    1    0    0
2    0    1    0
3    0    0    1
4    1    0    0
5    0    1    0
6    0    0    1
7    1    0    0
8    0    1    0
9    0    0    1

Henry Ecker · Accepted Answer · 2025-08-15 23:20:48Z

Sklearn's OneHotEncoder supports this type of one-hot encoding directly when specifying dense 2D output (sparse_output=False)

enc = OneHotEncoder(sparse_output=False)
e = enc.fit_transform(df)

# [[1. 0. 0. 1. 0. 0. 1. 0. 0.]
#  [0. 1. 0. 0. 1. 0. 0. 1. 0.]
#  [0. 0. 1. 0. 0. 1. 0. 0. 1.]]

This can be turned into the desired DataFrame by transposing the output and assigning the index to the categories_ created during the fit_transform then reindexed to match the desired output range (0 to max value inclusive)

out_df = pd.DataFrame(
    e.T,
    index=np.hstack(enc.categories_),
    dtype='int'
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)

import numpy as np  # version 2.3.2
import pandas as pd  # version 2.3.1
from sklearn.preprocessing import OneHotEncoder  # scikit-learn version 1.7.1

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

enc = OneHotEncoder(sparse_output=False)

e = enc.fit_transform(df)

out_df = pd.DataFrame(
    e.T,
    index=np.hstack(enc.categories_),
    dtype='int'
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)

print(out_df)

Note that the same values in different columns will be considered different features. The number 4 in column A and the number 4 in column B will be split into two rows with index 4 (representing the features A_4 and B_4).

Reindexing will not work if there are duplicate indexes. It is possible to get a count with .groupby(level=0).sum() or keep the indicator with groupby(level=0).max() before reindexing.

Alternatively, if just checking if the value is present in each row (without considering which source column it came from) then use MultiLabelBinarizer on the DataFrame's underlying numpy array.

For the given (modified) sample DataFrame:

A	B	C
1	4	4
4	5	8
3	6	9

enc = MultiLabelBinarizer()
e = enc.fit_transform(df.to_numpy())
# [[1 0 1 0 0 0 0]
#  [0 0 1 1 0 1 0]
#  [0 1 0 0 1 0 1]]

Similar to OneHotEncoder, this can be turned into the desired DataFrame by transposing the output and assigning the index to the classes_ created during the fit_transform and reindexed to include all values in the specified range:

out_df = pd.DataFrame(
    e.T,
    index=enc.classes_
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)

import pandas as pd  # version 2.3.1
from sklearn.preprocessing import MultiLabelBinarizer  # scikit-learn version 1.7.1

df = pd.DataFrame({'A': [1, 4, 3], 'B': [4, 5, 6], 'C': [4, 8, 9]})

enc = MultiLabelBinarizer()
e = enc.fit_transform(df.to_numpy())

out_df = pd.DataFrame(
    e.T,
    index=enc.classes_
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)

print(out_df)

Just wanted to note a couple of things. Firstly, if you have an integer that appears in more than one row (say the integer 4 appears in multiple rows), the encoder will create two rows with the index 4 for each non-unique integer, which is not my desired behavior. To address this, you simply group the new DataFrame by index and sum. Secondly, this method is the fastest of the ones I have tried for large DataFrames.
Thanks for the feedback @lanerogers glad you were able to get it working. I also updated my answer to reflect your suggestion and also provided an alternative sklearn transformation based on what you're looking for.

Ranger · Accepted Answer · 2025-08-05 06:11:57Z

Firstly, you should create new dataframe whose index contains the values from old dataframe values on different columns. Also the columns of the new dataframe should have values from your old dataframe index.

import pandas as pd
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], "C":[7,8,9], 'D':[10, 11,12]})

new_df = df.melt().set_index('value').drop('variable', axis = 1)
new_df[df.index.astype(str)] = 0
# I am just filling with zero; later will change necessary ones to 1 and keep the rest as 0.

Now we will match the index of the new_df the values from entire df and take it's row location. Then we will find that row location number from new_df columns (as strings) then assign 1 there. The rest is already 0.

for value in new_df.index:
    original_indices = df.isin([value]).any(axis=1)
    for idx in original_indices[original_indices].index:
        new_df.loc[value, str(idx)] = 1

Note this code works only if your values from df do not repeat (not have duplicate).

Collectives™ on Stack Overflow

Convert (many) integer-valued rows into binary indicator columns using Pandas

5 Answers 5

Comments

Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related