1

I've this list of sequences aqi_range and a dataframe df:

aqi_range = list(0:50,51:100,101:250)

df

   PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
 1      85.6        3      264       75.7         3       240
 2     105.         6      243       76.4         3       191
 3      95.8       19      287       48.4         8       134
 4      85.5       50      166       64.8        32       103
 5      55.9       24      117       46.7        19        77
 6      37.5        6      116       31.3         3        87
 7      26          5       69       15.5         3        49
 8      82.3       34      169       49.6        25       120
 9      170        68      272       133         67       201
10      254       189      323       226        173       269

Now I've created these two pretty simple functions that i want to apply to this dataframe to calculate the AQI=Air Quality Index for each pollutant.

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
        for (i in b){
          if (a %in% i){
           min_val = min(i)
           max_val = max(i)
           return (max_val - min_val)
        }}}

#a = column from a dataframe  **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
      for (i in b){
       if (a %in% i){
        min_val = min(i)
        return(min_val)
          } 
      }}

Basically the first function "min_max_diff" takes the value of column df$PM10_mean / df$PM2.5_mean and check for it in the list "aqi_range" and then returns a certain value (difference of min and max value of the sequence in which it's available). Similarly the second function "c_low" just returns the minimum value of the sequence.

I want to apply this kind of manipulation (formula defined below) to PM10_mean column to create new columns PM10_AQI:

df$PM10_AQI  = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean -  df$PM10_min) + c_low(df$PM10_mean,aqi_range)

I hope it explains it properly.

11
  • 1
    Can you share the output of dput(df)? Commented Jan 3, 2020 at 21:49
  • diff is a base R function. Please use another name Commented Jan 3, 2020 at 21:53
  • it's a pretty big dataframe. @IceCreamToucan Commented Jan 3, 2020 at 21:54
  • I've added the first 10 rows of the dataset. Can you please try to use the functions that i've defined ?? @IceCreamToucan Commented Jan 3, 2020 at 22:22
  • 1
    The first value in PM10_mean is 85.6 and you are checking for if (a %in% i) in the function. None of the values in aqi_range satisfies this criterion so a %in% i will never be true. Note that aqi_range has all integers whereas numbers in PM10_mean are decimals and you are performing an exact match. Do you want to check if the numbers are in range or something ? Also in the last part where you have shared data I am assuming your input has only two columns PM10_mean and PM2.5_mean, rest of them are your expected output columns. Commented Jan 4, 2020 at 1:18

1 Answer 1

1

If your problem is just how to compute the given transformation to several columns in a data frame, you could write a for loop, construct the name of each variable involved in the transformation using string transformation functions (in this case sub() is useful), and refer to the columns in the data frame using the [ notation (as opposed to the $ notation --since the [ notation accepts strings to specify columns).

Following I show an example of such code with a small sample data with 3 observations:

(note that I modified the definition of the AQI range values (now I just define the breaks where the range changes --assuming they are all integers), and your functions min_max_diff() and c_low() which are collapsed into one single function returning the min and max values of the AQI range where the values are found --again this assumes that the AQI values are integer values)

# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)

# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
                PM10_min=c(3, 6, 19),
                PM10_max=c(264, 243, 287),
                PM2.5_mean=c(75.7, 76.4, 48.4),
                PM2.5_min=c(3, 3, 8),
                PM2.5_max=c(240, 191, 134))

# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
  aqi_range_groups = findInterval(values, aqi_range_breaks)
  return( list(min=aqi_range_breaks[aqi_range_groups],
               max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}

# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
  vmin = sub("_mean$", "_min", vmean)
  vmax = sub("_mean$", "_max", vmean)
  vaqi = sub("_mean$", "_AQI", vmean)
  aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
  df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) / 
              (df[,vmax] - df[,vmin]) / (df[,vmean] -  df[,vmin]) +
              aqi_range_min_max$min
}

Note how the findInterval() function has been used to find the range where an array of values fall. That was the key to make your transformation work for a data frame column.

The expected output of this process is:

  PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max  PM10_AQI    PM2.5_AQI
1      85.6        3      264       75.7         3       240  51.00227 51.002843893
2     105.0        6      243       76.4         3       191 101.00635 51.003550930
3      95.8       19      287       48.4         8       134  51.00238  0.009822411

Please check the formula that computes AQI because you had a syntax error in it (look for / *, which I have replaced with / in the formula in my code).

Note that the use of $ in the regular expression used in sub() to match the string "_mean" is used to replace the "_mean" string only when it occurs at the end of the variable name.

Sign up to request clarification or add additional context in comments.

4 Comments

I tried this and it gave me this error: "Error in Ops.data.frame(min_max_diff(y[, vmean], aqi_range)/(y[, vmax] - : ‘*’ only defined for equally-sized data frames"
The functions "min_max_diff() and c_low()" work fine if give them any random number as an input , but isn't seems to be working on a dataframe columns. Any advise, how i can solve this ?
@astroluv I have just edited the answer so that now the transformation works with data frame columns. The key point was the use of the findInterval() function. Note how I redefined the AQI ranges (now defined simply as breaks), which, as noted, assumes that the values in the AQI ranges are all integer. As explained in the edit, I also simplified the two functions into a single function.
Thanks for the help. It worked for me and your comments helped alot in understanding the root cause.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.