3

Suppose we have a data frame

In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

In [2]: df
Out[3]:
     A   B   C   D
0   45  88  44  92
1   62  34   2  86
2   85  65  11  31
3   74  43  42  56
4   90  38  34  93
5    0  94  45  10
..  ..  ..  ..  ..

How can I randomly replace x% of all entries with a value, such as None?

In [4]: something(df, percent=25)
Out[5]:
     A   B   C   D
0   45  88  None  92
1   62  34   2  86
2   None  None  11  31
3   74  43  None  56
4   90  38  34  None
5    None  94  45  10
..  ..  ..  ..  ..

I've found information about sampling particular axes, and I can imagine a way of randomly generating integers within the dimensions of my data frame and setting those equal to None, but that doesn't feel very Pythonic.

  • Edit: forgot 'way' in title

1 Answer 1

4

You could combine DataFrame.where and np.random.uniform:

In [37]: df
Out[37]: 
   A  B  C  D
0  1  0  2  2
1  2  2  0  3
2  3  0  0  3
3  0  2  3  1

In [38]: df.where(np.random.uniform(size=df.shape) > 0.3, None)
Out[38]: 
      A  B     C     D
0     1  0     2  None
1     2  2     0     3
2     3  0  None  None
3  None  2     3  None

It's not the most concise, but gets the job done.

Note though that you should ask yourself whether you really want to do this if you still have computations to do. If you put None in a column, then pandas is going to have to use the slow object dtype instead of something fast like int64 or float64.

Sign up to request clarification or add additional context in comments.

3 Comments

I like random choice with true and false
Thank you, I'm not actually interested in substituting None, I'm interested in substituting a random set of nucleotides (A, C, G, or T) in a series of DNA sequences with 'N'. I'm not sure if this would change your answer. I ran a few tests trying to do this by randomly selecting integers within the dimensions and it rapidly becomes inefficient.
@IanGilman You are using a DataFrame of individual letters representing nucleotide sequences? You should probably make sure you aren't using the object dtype already. That will be space and time inefficient, if you know each value is a single letter, you can reduce your memory usage to a fraction of what it takes with object dtype

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.