0

A bit stuck thinking and reading about this..

  • Having a dataframe with about 8x10^6 rows
  • and with roughly 40 categories in which I'm interested
  • I'm trying two things (apologies for posting them together, but they seem highly related)
  • in the first place, I'm looking for an efficient way to randomly sample 100 rows from each category, i.e. var1 (which goes from 01 to 40)
  • ideally, I'd create a new dataframe with about 400 rows (instead of 8 million)
  • in the second place, I'd like to be able to take the average of all the var2 and var3 values, per var1 (being equal category that is)

Perhaps these are related in terms of methods.

My dataframe looks something like this (an oversimplification)

              var1     var2     var3     var3
1             01       949.47   ..       ..
2             01       935.09   ..       ..
3             01       935.01   ..       ..
4             01       355.39   ..       ..
5             01       455.07   ..       ..
6             01       525.08   ..       ..
..
250000        02       485.82   ..       ..
250001        02       204.14   ..       ..
250002        02       388.22   ..       ..
..

I've tried splitting the dataframe in a for-loop, but this doesn't succeed (never ends, and I need to kill the process).

for (i in 1:8000000){
   out <- split(dat, f = dat$var1)
}

Also, I'm not sure what to do next, how to manage all the seperate dataframes, and whether this is the best method.

Many thanks for any tips!

2
  • Thanks! I tried, but I get an error: unused argument (by = var1) Commented Nov 25, 2018 at 20:39
  • 1
    Sounds like you are trying to use data.table functions on a data.frame. Use dt <- as.data.table(df) to make a second data set which is a data.table, or use setDT(df) to update your data frame by reference. Commented Nov 25, 2018 at 22:08

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.