0

I'm new in R and I need some help getting some things done. First of all I have to analyse a huge dataset 766K rows with 2 columns in the form below:

G40 2003-04-09
Z11 1997-08-15
K60 2006-03-16
I10 2000-11-30

The name of the dataset is Rdiagnosesand there is no header so by default Col1 is V1 and Col2 is V2. The first column is the diagnoses and the second the date which it was diagnosed. First I was thinking on creating a subset for each year separably. The way I'm try to do it is this way however it gives me an error.

diagnoses2009 <- as.Date( as.character(Rdiagnoses$V2), "%d-%m-%y")

Rdiagnoses_2009 <- subset(Rdiagnoses, V2 >= as.Date("2009-01-01") & V2 <= as.Date("2009-12-31") )

 Warning messages:

1: In eval(expr, envir, enclos) :
Incompatible methods ("Ops.factor", "Ops.Date") for ">="

2: In eval(expr, envir, enclos) :
Incompatible methods ("Ops.factor", "Ops.Date") for "<="

Any suggestions of correcting that of a better way of choosing each year is highly appreciated. Thank you in advance for your help!

3
  • If I had to guess, V2 is a factor, not a date column. Commented May 22, 2014 at 14:13
  • I don't think you're coercing your date column correctly. Try x <- c("2003-04-09", "09-04-2003");as.Date(x, format = "%d-%m-%Y") (notice capital Y and order of days, months and years). Commented May 22, 2014 at 14:13
  • 1
    Are you planning to do some operations on each of the subsetted pieces of data? If yes, there may be an easier way than to create a separate data.frame per year. If you describe more detailed what you want to do, I'm sure someone will suggest something Commented May 22, 2014 at 14:19

1 Answer 1

1

So there are a couple of things going on here.

First, you (try to) set diagnoses2009 to a set of dates, but your subset expression does not use that variable at all.

Second, as @joran points out you are using the wrong format string: your dates are formatted as %Y-%m-%d. When you run as.Date(...) with an incorrect format string, you get NA for all the dates. So diagnoses2009 is a vector of NA.

Third, there are much better ways to split a dataframe. You could do this for example:

library(lubridate)
df.subsets <- split(df,year(as.Date(df$V2, "%Y-%m-%d")))

This creates a list of data frames, one for each year.

Finally, as @beginnerR points out, you didn't tell us anything about what you are planning to do with the split datasets. There might be a much better way to deal with your overall problem.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.