Sklearn's OneHotEncoder supports this type of one-hot encoding directly when specifying dense 2D output (sparse_output=False)
enc = OneHotEncoder(sparse_output=False)
e = enc.fit_transform(df)
# [[1. 0. 0. 1. 0. 0. 1. 0. 0.]
# [0. 1. 0. 0. 1. 0. 0. 1. 0.]
# [0. 0. 1. 0. 0. 1. 0. 0. 1.]]
This can be turned into the desired DataFrame by transposing the output and assigning the index to the categories_ created during the fit_transform then reindexed to match the desired output range (0 to max value inclusive)
out_df = pd.DataFrame(
e.T,
index=np.hstack(enc.categories_),
dtype='int'
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 0
9 0 0 1
import numpy as np # version 2.3.2
import pandas as pd # version 2.3.1
from sklearn.preprocessing import OneHotEncoder # scikit-learn version 1.7.1
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
enc = OneHotEncoder(sparse_output=False)
e = enc.fit_transform(df)
out_df = pd.DataFrame(
e.T,
index=np.hstack(enc.categories_),
dtype='int'
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)
print(out_df)
Note that the same values in different columns will be considered different features. The number 4 in column A and the number 4 in column B will be split into two rows with index 4 (representing the features A_4 and B_4).
Reindexing will not work if there are duplicate indexes. It is possible to get a count with .groupby(level=0).sum() or keep the indicator with groupby(level=0).max() before reindexing.
Alternatively, if just checking if the value is present in each row (without considering which source column it came from) then use MultiLabelBinarizer on the DataFrame's underlying numpy array.
For the given (modified) sample DataFrame:
enc = MultiLabelBinarizer()
e = enc.fit_transform(df.to_numpy())
# [[1 0 1 0 0 0 0]
# [0 0 1 1 0 1 0]
# [0 1 0 0 1 0 1]]
Similar to OneHotEncoder, this can be turned into the desired DataFrame by transposing the output and assigning the index to the classes_ created during the fit_transform and reindexed to include all values in the specified range:
out_df = pd.DataFrame(
e.T,
index=enc.classes_
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)
0 1 2
0 0 0 0
1 1 0 0
2 0 0 0
3 0 0 1
4 1 1 0
5 0 1 0
6 0 0 1
7 0 0 0
8 0 1 0
9 0 0 1
import pandas as pd # version 2.3.1
from sklearn.preprocessing import MultiLabelBinarizer # scikit-learn version 1.7.1
df = pd.DataFrame({'A': [1, 4, 3], 'B': [4, 5, 6], 'C': [4, 8, 9]})
enc = MultiLabelBinarizer()
e = enc.fit_transform(df.to_numpy())
out_df = pd.DataFrame(
e.T,
index=enc.classes_
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)
print(out_df)
for-loop or use.apply()to run it on every row.new_df['index']? Butnew_dfisn't defined yet... If that's just a typo, please edit to fix it.