R - Create missingness in DataFrame for testing

Question

I need to test some imputation evaluation software I'm creating and am struggling to get benchmark datasets.

Does anyone know of a way to delete a certain amount of data from a dataframe.

As an example of what I need:

You have a dataset and you want a random 20% of the rows to have a random amounts of variables in that row removed (ie. NA)

Or: Something that can turn

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Into:

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4          NA    6 160.0  NA 3.90 2.620   NA   0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8  NA 108.0  93  NA    NA  18.61  NA 1    NA   1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

I have tried a couple of methods that manipulate the columns but these have some fundamental flaws in them which render them useless.

This is my first every question on here, if I have missed out anything or done something wrong, please do let me know.

All the best

Welcome to StackOverflow. Please have a look here and try to give a minimal reproducible example that may help others to help you. — Ronak Shah
– Ronak Shah, Commented Oct 5, 2016 at 10:47
Hello Ronak, is that what you were thinking about in terms of reproducible example? m-dz, thank you for the link, I am looking through it now to see if I can use that. — abdnChap
– abdnChap, Commented Oct 5, 2016 at 11:06
I think you should also think about the mechanism of missingness in your data. Are you suggesting data missing at random, missing completely at random or missing not at random? — Wietze314
– Wietze314, Commented Oct 5, 2016 at 11:24

tobiasegli_te · Accepted Answer · 2016-10-05 11:19:08Z

1

This should do it:

df_new <- as.data.frame(apply(mtcars,2,function(x){
    x[sample(1:length(x),round(length(x)*0.2))] <- NA
    return(x)
}))

Apply() goes through the columns and in each column sample() is used to randomly select 20% of the values to be set to NA.

New answer after comment:

This randomly adds NA in 10% of all rows.

df <- mtcars
random_rows <- sample(1:nrow(df),round(nrow(df)*0.2))
for(i_row in random_rows){
    df[i_row,sample(1:ncol(df),sample(1:ncol(df),1))] <- NA
}

edited Oct 5, 2016 at 11:19

answered Oct 5, 2016 at 11:07

tobiasegli_te

1,4631 gold badge13 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

abdnChap Over a year ago

Hello! Thank you for your answer but I don't think this will work for what I need. I already created a function that can removed a random percentage from all columns but what I need is something that can change 10% of the rows, if you removed 10% randomly from each column, you change more than 10% of the rows. And if you set this to 40% of more, then all rows are changed.

tobiasegli_te Over a year ago

So you mean that 10% of the rows contain exactly one NA or at least one NA?

abdnChap Over a year ago

A random 10% of the rows should have a random amount of NA in them.

abdnChap Over a year ago

I'm just going to play with your suggestion for a little bit before accepting it, but by the looks of it, it does exactly what I needed. I wasn't aware that one could use sample to change the dataframe it came from.

tobiasegli_te Over a year ago

Glad to help. By the way, if you want to make your results reproducable, add set.seed(42) before running sample(). If you re-run at a later timepoint, the random number generator will then draw the same random numbers.

|

Collectives™ on Stack Overflow

R - Create missingness in DataFrame for testing

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related