1

I need to test some imputation evaluation software I'm creating and am struggling to get benchmark datasets.

Does anyone know of a way to delete a certain amount of data from a dataframe.

As an example of what I need:

You have a dataset and you want a random 20% of the rows to have a random amounts of variables in that row removed (ie. NA)

Or: Something that can turn

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Into:

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4          NA    6 160.0  NA 3.90 2.620   NA   0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8  NA 108.0  93  NA    NA  18.61  NA 1    NA   1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

I have tried a couple of methods that manipulate the columns but these have some fundamental flaws in them which render them useless.

This is my first every question on here, if I have missed out anything or done something wrong, please do let me know.

All the best

5
  • Welcome to StackOverflow. Please have a look here and try to give a minimal reproducible example that may help others to help you. Commented Oct 5, 2016 at 10:47
  • Please see stats.stackexchange.com/questions/184741/… Commented Oct 5, 2016 at 10:50
  • Hello Ronak, is that what you were thinking about in terms of reproducible example? m-dz, thank you for the link, I am looking through it now to see if I can use that. Commented Oct 5, 2016 at 11:06
  • @abdnChap yes, perfect! Commented Oct 5, 2016 at 11:12
  • I think you should also think about the mechanism of missingness in your data. Are you suggesting data missing at random, missing completely at random or missing not at random? Commented Oct 5, 2016 at 11:24

1 Answer 1

1

This should do it:

df_new <- as.data.frame(apply(mtcars,2,function(x){
    x[sample(1:length(x),round(length(x)*0.2))] <- NA
    return(x)
}))

Apply() goes through the columns and in each column sample() is used to randomly select 20% of the values to be set to NA.

New answer after comment:

This randomly adds NA in 10% of all rows.

df <- mtcars
random_rows <- sample(1:nrow(df),round(nrow(df)*0.2))
for(i_row in random_rows){
    df[i_row,sample(1:ncol(df),sample(1:ncol(df),1))] <- NA
} 
Sign up to request clarification or add additional context in comments.

6 Comments

Hello! Thank you for your answer but I don't think this will work for what I need. I already created a function that can removed a random percentage from all columns but what I need is something that can change 10% of the rows, if you removed 10% randomly from each column, you change more than 10% of the rows. And if you set this to 40% of more, then all rows are changed.
So you mean that 10% of the rows contain exactly one NA or at least one NA?
A random 10% of the rows should have a random amount of NA in them.
I'm just going to play with your suggestion for a little bit before accepting it, but by the looks of it, it does exactly what I needed. I wasn't aware that one could use sample to change the dataframe it came from.
Glad to help. By the way, if you want to make your results reproducable, add set.seed(42) before running sample(). If you re-run at a later timepoint, the random number generator will then draw the same random numbers.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.