Subsetting using data.table instead of data.frame

Question

I am dealing with a data frame with 3 million rows and 10 columns and I am doing some subsetting on it. I have some toy code below. When I subset it takes a long time. If I use data.table and subset on data.table will that be faster? Here is some toy code:

s<-c(100,100,100,800,800,6662,33565,265653262,266532)
p<-c(5,5,5,10,10,10,8,9,10)
name<-c("bob","bob","bob","ed","ed","ed","joe","frank","ted")
time<- as.POSIXct(as.character(c("2014-10-27 18:11:36 PDT","2014-10-27 18:11:37 PDT","2014-10-27 18:11:38 PDT","2014-10-27 18:11:39 PDT","2014-10-27 18:11:40 PDT","2014-10-27 18:11:41 PDT","2014-10-27 19:11:36 PDT","2014-10-27 20:11:36 PDT","2014-10-27 21:11:36 PDT")))
dat<- data.frame(s,p,name,time)
dat

here is the data frame:

          s  p  name                time
1       100  5   bob 2014-10-27 18:11:36
2       100  5   bob 2014-10-27 18:11:37
3       100  5   bob 2014-10-27 18:11:38
4       800 10    ed 2014-10-27 18:11:39
5       800 10    ed 2014-10-27 18:11:40
6      6662 10    ed 2014-10-27 18:11:41
7     33565  8   joe 2014-10-27 19:11:36
8 265653262  9 frank 2014-10-27 20:11:36
9    266532 10   ted 2014-10-27 21:11:36

now I subset on the dataframe:

  result <- subset(dat,    as.numeric(s) == 100
                   &  p == 5
                   &  name  == "bob"
                   & time >= "2014-10-27 18:11:36 PDT"
                   & time <= "2014-10-27 18:12:00 PDT"
                   )
  result

    s p name                time
1 100 5  bob 2014-10-27 18:11:36
2 100 5  bob 2014-10-27 18:11:37
3 100 5  bob 2014-10-27 18:11:38

How can I do something similar using data.table?

Thank you.

Oliver Keyes · Accepted Answer · 2014-11-28 17:51:11Z

3

Well, your example code actually break for data frames thanks to the "time" selectors - you're trying to match POSIXlt dates (in the data frame) with character strings (in the selector). I think you want:

result <- subset(dat,    as.numeric(s) == 100
               &  p == 5
               &  name  == "bob"
               & time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
               & time <= as.POSIXlt("2014-10-27 18:12:00 PDT")
               )

result
    s p name                time
1 100 5  bob 2014-10-27 18:11:36
2 100 5  bob 2014-10-27 18:11:37
3 100 5  bob 2014-10-27 18:11:38

This syntax works perfectly well for data.tables:

dat <- as.data.table(dat)
result <- subset(dat,
              as.numeric(s) == 100
              &  p == 5
              &  name  == "bob"
              & time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
              & time <= as.POSIXlt("2014-10-27 18:12:00 PDT")
)
result

     s p name                time
1: 100 5  bob 2014-10-27 18:11:36
2: 100 5  bob 2014-10-27 18:11:37
3: 100 5  bob 2014-10-27 18:11:38

If you want something more data.table-like, you can avoid "subset" entirely and instead just operate on the data.table directly:

dat <- as.data.table(dat)
result <- dat[as.numeric(s) == 100
              & p == 5
              & name  == "bob"
              & time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
              & time <= as.POSIXlt("2014-10-27 18:12:00 PDT"),]

result 

     s p name                time
1: 100 5  bob 2014-10-27 18:11:36
2: 100 5  bob 2014-10-27 18:11:37
3: 100 5  bob 2014-10-27 18:11:38

answered Nov 28, 2014 at 17:51

Oliver Keyes

3,3742 gold badges19 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3022875 Over a year ago

it did not break my code. if i use data.table will it be faster than using data.frame?

Oliver Keyes Over a year ago

Then I'm not sure what environment you must be using to be able to check if POSIX timestamps are less than or greater than strings ;). For subset operations? Benchmark it and test. I tend to use data.table in situations where I'll want to perform subset-wise operations on the data to extract or synthesise values - there, they're a lot faster.

Collectives™ on Stack Overflow

Subsetting using data.table instead of data.frame

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related