9

I am working on a task that seems to me a little like one-hot encoding, but notably different. What I want to do is take a row of integers from a Pandas DataFrame and produce a binary column with 1's at the index locations specified by the integers and 0's everywhere else. If possible, I would like to do this for many rows at the same time. So a trivial example would be given as taking

index A B C
0 1 4 7
1 2 5 8
2 3 6 9

and producing

index .0 .1 .2
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 0
9 0 0 1

I have tried

new_df = pandas.DataFrame(range(old_df.max(axis=None)+1)).isin(list(old_df.iloc[0]))

which works for a single row (the first row of the old_df in this case), but doesn't seem easily scalable to an arbitrary number of rows. Is there a built in function that does something similar to this?

4
  • 1
    maybe you have to use for-loop or use .apply() to run it on every row. Commented Aug 5 at 1:34
  • 1
    No; don't do that. Use numeric indexing. Commented Aug 5 at 2:29
  • 1
    new_df['index']? But new_df isn't defined yet... If that's just a typo, please edit to fix it. Commented Aug 5 at 3:12
  • what is the logic behind your desired output? Commented Aug 5 at 6:05

5 Answers 5

7

Another possible solution:

pd.get_dummies(df.stack()).groupby(level=0).sum().T

First, it uses stack to pivot the data from a wide to a long format, creating a series with a MultiIndex that pairs original row and column positions. This long-format series is then fed into get_dummies to perform the one-hot encoding, converting the integer values into new binary columns. Subsequently, groupby(level=0).sum() aggregates these results by summing the binary indicators for each of the original rows (the first level of the MultiIndex). Finally, the .T attribute pivots the resulting dataframe.

Output:

   0  1  2
1  1  0  0
2  0  1  0
3  0  0  1
4  1  0  0
5  0  1  0
6  0  0  1
7  1  0  0
8  0  1  0
9  0  0  1

Intermediates:


# (df.stack(), 
#  pd.get_dummies(df.stack()),  
#  [g for g in pd.get_dummies(df.stack()).groupby(level=0)], 
#  pd.get_dummies(df.stack()).groupby(level=0).sum())

(0  A    1
    B    4
    C    7
 1  A    2
    B    5
    C    8
 2  A    3
    B    6
    C    9
 dtype: int64,
          1      2      3      4      5      6      7      8      9
 0 A   True  False  False  False  False  False  False  False  False
   B  False  False  False   True  False  False  False  False  False
   C  False  False  False  False  False  False   True  False  False
 1 A  False   True  False  False  False  False  False  False  False
   B  False  False  False  False   True  False  False  False  False
   C  False  False  False  False  False  False  False   True  False
 2 A  False  False   True  False  False  False  False  False  False
   B  False  False  False  False  False   True  False  False  False
   C  False  False  False  False  False  False  False  False   True,
 [(0,
            1      2      3      4      5      6      7      8      9
   0 A   True  False  False  False  False  False  False  False  False
     B  False  False  False   True  False  False  False  False  False
     C  False  False  False  False  False  False   True  False  False),
  (1,
            1      2      3      4      5      6      7      8      9
   1 A  False   True  False  False  False  False  False  False  False
     B  False  False  False  False   True  False  False  False  False
     C  False  False  False  False  False  False  False   True  False),
  (2,
            1      2      3      4      5      6      7      8      9
   2 A  False  False   True  False  False  False  False  False  False
     B  False  False  False  False  False   True  False  False  False
     C  False  False  False  False  False  False  False  False   True)],
    1  2  3  4  5  6  7  8  9
 0  1  0  0  1  0  0  1  0  0
 1  0  1  0  0  1  0  0  1  0
 2  0  0  1  0  0  1  0  0  1)

Sign up to request clarification or add additional context in comments.

Comments

2

This post is intended to simply add a bit of extra detail to the answer supplied by PaulS. His solution is great, but for my use case, I need to include rows for every integer in a given range beginning from zero (not just which integers happen to appear as elements in the given DataFrame old_df). To address this, simply set a value of max_col_index and perform:

dummies = pandas.get_dummies(old_df.stack())
missing_indices = set(range(max_col_index)) - set(dummies.columns)

new_df = pandas.concat([
dummies.groupby(level=0).sum(),
pandas.DataFrame(dict.fromkeys(list(map(str,missing_indices)),0), index=dummies.groupby(level=0).sum().index)
], axis=1).T

I'm not sure if there is a more concise way to do this using the DataFrame index property (that is, fill a DataFrame with null rows on missing index values up to a given maximum value), but this works well enough for me.

Edit:

It turns out I was correct in my above intuition. A slightly slicker (more "pythonic") version can be given as:

new_df = pandas.get_dummies(old_df.stack()).groupby(level=0).sum().T.reindex(range(max_col_index), fill_value=0)

Comments

1

Assuming a range index in the input DataFrame (if not, run df = df.reset_index(drop=True)).

You could melt and pivot:

out = (df.rename(lambda x: x/10) # optional, just to get the 0.0/0.1/0.2
         .reset_index().melt('index').assign(x=1)
         .pivot_table(index='value', columns='index', values='x', fill_value=0)
         .convert_dtypes().rename_axis(index=None, columns=None)
      )

Or melt and join+pd.get_dummies:

tmp = (
    df.rename(lambda x: x / 10) # optional, just to get the 0.0/0.1/0.2
    .reset_index()
    .melt('index').drop(columns='variable')
)
out = (pd.get_dummies(tmp['index'], dtype='int')
         .set_axis(tmp['value'].rename(None))
      )

Or a more minimal version with unstack and pd.get_dummies:

tmp = df.unstack()
out = (pd.get_dummies(tmp.index.get_level_values(1)/10, dtype='int')
         .set_axis(tmp)
      )

Or with unstack and pd.crosstab:

tmp = df.unstack()
out = (pd.crosstab(tmp, tmp.index.get_level_values(1)/10)
         .rename_axis(index=None, columns=None)
      )

Output:

   0.0  0.1  0.2
1    1    0    0
2    0    1    0
3    0    0    1
4    1    0    0
5    0    1    0
6    0    0    1
7    1    0    0
8    0    1    0
9    0    0    1

Comments

1

Sklearn's OneHotEncoder supports this type of one-hot encoding directly when specifying dense 2D output (sparse_output=False)

enc = OneHotEncoder(sparse_output=False)
e = enc.fit_transform(df)

# [[1. 0. 0. 1. 0. 0. 1. 0. 0.]
#  [0. 1. 0. 0. 1. 0. 0. 1. 0.]
#  [0. 0. 1. 0. 0. 1. 0. 0. 1.]]

This can be turned into the desired DataFrame by transposing the output and assigning the index to the categories_ created during the fit_transform then reindexed to match the desired output range (0 to max value inclusive)

out_df = pd.DataFrame(
    e.T,
    index=np.hstack(enc.categories_),
    dtype='int'
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)
   0  1  2
0  0  0  0
1  1  0  0
2  0  1  0
3  0  0  1
4  1  0  0
5  0  1  0
6  0  0  1
7  1  0  0
8  0  1  0
9  0  0  1
import numpy as np  # version 2.3.2
import pandas as pd  # version 2.3.1
from sklearn.preprocessing import OneHotEncoder  # scikit-learn version 1.7.1

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

enc = OneHotEncoder(sparse_output=False)

e = enc.fit_transform(df)

out_df = pd.DataFrame(
    e.T,
    index=np.hstack(enc.categories_),
    dtype='int'
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)

print(out_df)

Note that the same values in different columns will be considered different features. The number 4 in column A and the number 4 in column B will be split into two rows with index 4 (representing the features A_4 and B_4).

Reindexing will not work if there are duplicate indexes. It is possible to get a count with .groupby(level=0).sum() or keep the indicator with groupby(level=0).max() before reindexing.


Alternatively, if just checking if the value is present in each row (without considering which source column it came from) then use MultiLabelBinarizer on the DataFrame's underlying numpy array.

For the given (modified) sample DataFrame:

A B C
1 4 4
4 5 8
3 6 9
enc = MultiLabelBinarizer()
e = enc.fit_transform(df.to_numpy())
# [[1 0 1 0 0 0 0]
#  [0 0 1 1 0 1 0]
#  [0 1 0 0 1 0 1]]

Similar to OneHotEncoder, this can be turned into the desired DataFrame by transposing the output and assigning the index to the classes_ created during the fit_transform and reindexed to include all values in the specified range:

out_df = pd.DataFrame(
    e.T,
    index=enc.classes_
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)
   0  1  2
0  0  0  0
1  1  0  0
2  0  0  0
3  0  0  1
4  1  1  0
5  0  1  0
6  0  0  1
7  0  0  0
8  0  1  0
9  0  0  1
import pandas as pd  # version 2.3.1
from sklearn.preprocessing import MultiLabelBinarizer  # scikit-learn version 1.7.1

df = pd.DataFrame({'A': [1, 4, 3], 'B': [4, 5, 6], 'C': [4, 8, 9]})

enc = MultiLabelBinarizer()
e = enc.fit_transform(df.to_numpy())

out_df = pd.DataFrame(
    e.T,
    index=enc.classes_
).reindex(pd.RangeIndex(start=0, stop=df.max(axis=None) + 1), fill_value=0)

print(out_df)

2 Comments

Just wanted to note a couple of things. Firstly, if you have an integer that appears in more than one row (say the integer 4 appears in multiple rows), the encoder will create two rows with the index 4 for each non-unique integer, which is not my desired behavior. To address this, you simply group the new DataFrame by index and sum. Secondly, this method is the fastest of the ones I have tried for large DataFrames.
Thanks for the feedback @lanerogers glad you were able to get it working. I also updated my answer to reflect your suggestion and also provided an alternative sklearn transformation based on what you're looking for.
0

Firstly, you should create new dataframe whose index contains the values from old dataframe values on different columns. Also the columns of the new dataframe should have values from your old dataframe index.

import pandas as pd
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], "C":[7,8,9], 'D':[10, 11,12]})

new_df = df.melt().set_index('value').drop('variable', axis = 1)
new_df[df.index.astype(str)] = 0
# I am just filling with zero; later will change necessary ones to 1 and keep the rest as 0.

Now we will match the index of the new_df the values from entire df and take it's row location. Then we will find that row location number from new_df columns (as strings) then assign 1 there. The rest is already 0.

for value in new_df.index:
    original_indices = df.isin([value]).any(axis=1)
    for idx in original_indices[original_indices].index:
        new_df.loc[value, str(idx)] = 1

Note this code works only if your values from df do not repeat (not have duplicate).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.