Pandas: select value from random column on each row

Question

Suppose I have the following Pandas DataFrame:

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6],
    'c': [7, 8, 9]
})

    a   b   c
0   1   4   7
1   2   5   8
2   3   6   9

I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:

0    7
1    2
2    9
dtype: int64

(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).

I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?

anky · Accepted Answer · 2019-07-25 12:30:31Z

5

May be something like:

pd.Series([np.random.choice(i,1)[0] for i in df.values])

answered Jul 25, 2019 at 12:30

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joseph Garvin Over a year ago

If you need to make many columns like this, I found this to be much more efficient: newdf = df.apply(lambda row : row.sample(n=10000, replace=True).values, axis=1, result_type="expand")

sjw · Accepted Answer · 2019-07-25 12:35:57Z

5

Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.

import numpy as np

indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)

Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].

Then use this to slice the numpy array:

df['random'] = df.to_numpy()[np.arange(len(df)), indices]

Results:

   a  b  c  random
0  1  4  7       7
1  2  5  8       5
2  3  6  9       9

edited Jul 25, 2019 at 12:35

answered Jul 25, 2019 at 12:30

sjw

6,5512 gold badges30 silver badges41 bronze badges

4 Comments

Joseph Garvin Over a year ago

Could you explain what this part of the syntax is doing [np.arange(len(df)), indices] ? I'm having trouble googling it.

Joseph Garvin Over a year ago

Also your answer gives very different distributions on my data than the answer from @jfaccioni, in ways that make me think something is wrong with this answer (I get drastically smaller means with your answer). It does run much faster though.

sjw Over a year ago

@JosephGarvin - that syntax is slicing the 2D array, using two sequences of the same length representing row indices and column indices. Here, np.arange(len(df)) is the row indices - in the example above it is just [0, 1, 2]. The randomly selected column indices are in the indices variable ([1, 2, 1] in the example). Slicing with [0, 1, 2] and [1, 2, 1] returns the values at (0, 1), (1, 2) and (2, 1).

sjw Over a year ago

@JosephGarvin - I've looped over my solution a few million times and am not seeing anything unusual in the results, at least using the small example data set in this question. If you can share a dataset (or let me know how I can generate one) that shows my solution giving different results from the looped solutions I'll be interested to see.

Valentino · Accepted Answer · 2019-07-25 12:31:41Z

2

This does the job (using the built-in module random):

ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)

or using pandas sample:

ddf = df.apply(lambda row : row.sample(), axis=1)

Both have the same behaviour. ddf is your Series.

answered Jul 25, 2019 at 12:31

Valentino

7,3716 gold badges22 silver badges36 bronze badges

1 Comment

tdy Over a year ago

note that #1 can be simplified to df.apply(random.choice, axis=1)

mujjiga · Accepted Answer · 2019-07-25 12:37:35Z

1

pd.DataFrame(
    df.values[range(df.shape[0]), 
                   np.random.randint(
                       0, df.shape[1], size=df.shape[0])])

output

answered Jul 25, 2019 at 12:37

mujjiga

17.1k2 gold badges37 silver badges54 bronze badges

Comments

jfaccioni · Accepted Answer · 2019-07-25 13:35:45Z

1

You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.

You can, however, simplify the to a single line using a list comprehension, if it suits your style:

result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])

edited Jul 25, 2019 at 13:35

answered Jul 25, 2019 at 12:38

jfaccioni

7,5591 gold badge11 silver badges27 bronze badges

Collectives™ on Stack Overflow

Pandas: select value from random column on each row

5 Answers 5

1 Comment

4 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

4 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related