4

I know this may be accomplished with a for-loop, but am certain there is a more elegant solution within the construct of data.table.

I have two data tables, and will use 'iris' to illustrate my issue:

library("data.table")
A <- as.data.table(iris)                      #primary data table
B <- A[Sepal.Width > 3, .N, by = Species]     #count from A meeting condition

head(A, 3)
#       Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1:          5.1         3.5          1.4         0.2     setosa
#2:          4.9         3.0          1.4         0.2     setosa
#3:          4.7         3.2          1.3         0.2     setosa

B
#      Species  N
#1:     setosa 42
#2: versicolor  8
#3:  virginica 17

I would like to add a new variable to B which is simply the proportion of the data set that B represents, i.e. for the first row the output would be something like:

B[, Proportion := N/nrow(A[Species == "setosa"])]

The RHS of that index would obviously need to be dynamic, referencing the value of the first column in B by row..

It is this iteration that eludes me (though I feel it has to do with the data table key(s) perhaps?); greatly appreciate any help!

2 Answers 2

4

I would approach this as follows:

A <- as.data.table(iris)
B <- A[Sepal.Width > 3, .N, by = .("spec" = Species)]

B[, Proportion := N/nrow(A[Species == spec]), by = spec]

which gives:

> B
         spec  N Proportion
1:     setosa 42       0.84
2: versicolor  8       0.16
3:  virginica 17       0.34

Explanation:

  • By renaming the Species column to spec, you prevent R & data.table from not knowing which column to take for the calculation of the Proportion.
  • Using by = spec takes care of that the correct spec is used in A[Species == spec].
Sign up to request clarification or add additional context in comments.

1 Comment

Jaap this worked perfectly on my (much larger) data tables.. I've marked it as so, but would you mind explaining it a bit in words? I believe your assignment of B differs in that you've given it a different column/variable name ("spec" v. "Species") to be utilized in the index for Proportion, yes?.. why is the by necessary though? I tested without it and saw the results were not correct but I can't seem to understand how this corrected it..
1

One question many solutions ;-)

library("data.table")
A <- as.data.table(iris)                      #primary data table

B <- A[, .(group.count = nrow(.SD[Sepal.Width > 3]), total.count = .N), by = Species]
         [, Proportion := group.count / total.count]

# Just to validate the total counts:
A[, .N, by = Species][]

Result:

      Species group.count total.count Proportion
1:     setosa          42          50       0.84
2: versicolor           8          50       0.16
3:  virginica          17          50       0.34

How it works:

Group by species first, then count for each group (contained in the variable .SD = "sub data" of the current group) whereby the rows of each group are filtered again to count only the relevant ones. Then I use the result in a second "chained" data.table query (within the second square brackets) to calculate the proportions.

The .() operator is a data.table is an abrev. for the list constructor function list() and required since I return more than one column.

The := operator creates a new column by reference (= without copying the whole data table = very fast).

2 Comments

One question many solutions ;-) couldn't agree more! I like this also as it elucidates the use of .SD well, I only prefer the original answer as it avoids an additional column -- in this example it is not as noticeable but with the dataset I'm working with which is already large it is suboptimal for me to be able to keep track of everything. Nonetheless I am grateful you point out this version
@daRknight Could you give us a performance comparison of both solutions (let us learn from your experiences too :-) ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.