0

Suppose the following data frame (in reality my data frame has thousands of rows):

year<-c(2010,2010,2010,2011,2011,2011,2012,2012,2013,2013)
a1<-rnorm(10)
a2<-rnorm(10)
b1<-rnorm(10)
b2<-rnorm(10)
c1<-rnorm(10)
c2<-rnorm(10)

I used the following code to create a list consisting of multiple data frames, which splits the original data frame into subsets by year.

#split datasets into years
df.list<-split(df, df$year)

#Name of datasets df plus year
dfnames <- str_c("df", names(df.list))
names(df.list)<-dfnames

I want to apply the following loop to all data frames of the list:

#df_target is a new data frame that stores the results and j is the indicator for it:
df_target <- NULL
j <- 1

for(i in seq(2, 7, 2)) {
  df_target[[j]] <- (df[i]*df[i+1])/(sum(df[i+1]))
  j <- j+1
  }
}

The code works fine for one data frame, however, I want to split the data frame into multiple data frames grouped by year and then loop through the columns.

Thus, I use the following function to apply the loop mentioned above to all data frames from the list:

df_target <- NULL
j <- 1

fnc <- function(x){
  for(i in seq(2, 7, 2)) {
  df_target[[j]] <- (x[i]*x[i+1])/(sum(x[i+1]))
  j <- j+1
  }
}

sapply(df.list, fnc)

With this code, I don't get any error messages, however both data frames from the list are NULL. What exactly am I doing wrong?

df_target should be a data frame containing columns a_new= (a1a2)/sum(a2), b_new= (b1b2)/sum(b2) and c_new= (c1*c2)/sum(c2) but for each year separately.

4
  • df has only 7 columns, but your i index tries to select columns 8 and 10m which doesn't exist, so i cant run your code. Also, can you explain what you're trying yo achieve? you want df_target to be ...? Commented Oct 26, 2020 at 13:13
  • @RicardoSemiãoeCastro sorry, I have edited my question. It should work now. Commented Oct 26, 2020 at 13:15
  • Splitting the data into multiple small data.frames is rarely a good strategy, better toreshape the data, then use dplyr::group_by and dplyr::mutate Commented Oct 26, 2020 at 13:33
  • @RichardTelford can I then use the loop? I do not want to select the columns by names or index because sometimes I have datasets with 90 columns. Therefore, I was using the loop and would like to apply the loop to the list of data frames. Commented Oct 26, 2020 at 13:41

2 Answers 2

1

You need to define j and df_target inside the function, and set what should it return (as it is now, it makes the calculation of df_target, but doesn't return's it):

fnc <- function(x){
  df_target <- NULL
  j <- 1
  for(i in seq(2, 7, 2)) {
  df_target[[j]] <- (x[i]*x[i+1])/(sum(x[i+1]))
  j <- j+1
  }
  return(df_target)
}

But keep in mind that this will output a matrix of lists, as for each element of df.list that sapply will select, you'll be creating a 3 element list of df_target, so the output will look like this in the console:

> sapply(df.list, fnc)
     df2010 df2011 df2012 df2013
[1,] List,1 List,1 List,1 List,1
[2,] List,1 List,1 List,1 List,1
[3,] List,1 List,1 List,1 List,1

But will be this:

enter image description here

To get a cleaner output, we can set df_target to create a data frame with the values from each year:

fnc <- function(x){
  df_target <- as.data.frame(matrix(nrow=nrow(x), ncol=3))
  for(i in seq(2, 7, 2)) {
    df_target[,i/2] <- (x[i]*x[i+1])/(sum(x[i+1]))
  }
return(df_target)}

This returns a df per year, but if we use sapply we'll get a similar output of matrix of lists, so its better to define the function to already loop trough every year:

fnc <- function(y){
  df_target.list <- list()
  k=1
  for(j in y){
    df_target <- as.data.frame(matrix(nrow=nrow(j), ncol=3))
    for(i in seq(2, 7, 2)) {
      df_target[,i/2] <- (j[i]*j[i+1])/(sum(j[i+1]))
    }
    df_target.list[[names(y)[k]]] = df_target
    k=k+1
  }
  return(df_target.list)}

Output:

> fnc(df.list)
$df2010
           V1         V2          V3
1 -0.10971160 0.01688244 -0.16339367
2  0.05440564 0.57554210 -0.06803244
3  0.03185178 0.90598561 -0.68692401

$df2011
           V1           V2         V3
1 -0.43090055  0.007152131  0.3930606
2  0.15050644  0.329092942 -0.1367295
3  0.07336839 -0.423631930 -0.1504056

$df2012
         V1         V2         V3
1 0.5540294  0.4561862 0.09169914
2 0.1153931 -1.1311450 0.81853691

$df2013
          V1        V2        V3
1  0.4322934 0.5286973 0.2136495
2 -0.2412705 0.1316942 0.1455196
Sign up to request clarification or add additional context in comments.

3 Comments

Is there any way to combine the lists belonging to one year?
Thanks that works! Is there a way to change the column names to the initial ones?
You can add the argument col.names=letters[1:3] to as.data.frame when defining df_target. If you want to be a little more flexible you can do col.names=colnames(j)[seq(2, 7, 2)].
1

Here is a tidyverse solution. Try running this bit by bit so you can see what it does.

First it adds the rowid as a column to make sure unique rows can be identified later. Then it reshapes the data using pivot_longer to put the data into long format, and then pivot_wider to partially reverse this. Then the data are grouped and the calculation run. This is running a loop internally.

library(tidyverse)
set.seed(123)
tibble(
  year = c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2013, 2013),
  a1 = rnorm(10),
  a2 = rnorm(10),
  b1 = rnorm(10),
  b2 = rnorm(10),
  c1 = rnorm(10),
  c2 = rnorm(10)
) %>% 
  rowid_to_column() %>% 
  pivot_longer(cols = -c(year, rowid), names_to = c("nameA", "name12"), names_pattern = "(\\w)(\\d)" ) %>% 
  pivot_wider(names_from = name12, values_from = value) %>% 
  group_by(nameA) %>% 
  mutate(j = `1` * `2` / (sum(`2`)))
#> # A tibble: 30 x 6
#> # Groups:   nameA [3]
#>    rowid  year nameA     `1`     `2`        j
#>    <int> <dbl> <chr>   <dbl>   <dbl>    <dbl>
#>  1     1  2010 a     -0.560   1.22   -0.329  
#>  2     1  2010 b     -1.07    0.426  -0.141  
#>  3     1  2010 c     -0.695   0.253  -0.0794 
#>  4     2  2010 a     -0.230   0.360  -0.0397 
#>  5     2  2010 b     -0.218  -0.295   0.0200 
#>  6     2  2010 c     -0.208  -0.0285  0.00268
#>  7     3  2010 a      1.56    0.401   0.299  
#>  8     3  2010 b     -1.03    0.895  -0.285  
#>  9     3  2010 c     -1.27   -0.0429  0.0245 
#> 10     4  2011 a      0.0705  0.111   0.00374
#> # … with 20 more rows

Created on 2020-10-26 by the reprex package (v0.3.0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.