0

How does it work to aggregate a variable in a postgres db backend table to its unique value?

For example, I have the following table:

library(tidyverse)
library(dbplyr)

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

But neither unique() nor distinct() are doing the job. Any ideas how to achieve my desired outcome like so when I collect() the table before summarise:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
collect() %>%
summarise(aggregatedSum = sum(b), 
          aggregatedUnique = unique(a))

# A tibble: 1 x 2
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2
4
  • Does column a in your database always only contain one distinct value? Or do you have several distinct values in column a and you want a row for each one in your output? Commented Oct 14, 2021 at 14:55
  • it always contains one distinct value Commented Oct 14, 2021 at 15:00
  • but your second point is also interesting Commented Oct 14, 2021 at 15:01
  • Related: stackoverflow.com/q/25758554/3358272 Commented Oct 14, 2021 at 15:18

2 Answers 2

1

You can just add a group_by to your dplyr pipe:

> dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
+   group_by(a) %>% 
+   summarise(aggregatedSum = sum(b)) %>% 
+   rename(aggregatedUnique = a) %>% 
+   select(aggregatedSum, aggregatedUnique)

# Source:   lazy query [?? x 2]
# Database: sqlite 3.34.1 [:memory:]
  aggregatedSum aggregatedUnique
          <dbl>            <dbl>
1             9                2

If there are multiple distinct values in column a, this will return a row per value (with the sum of the b values that occur alongside them).

Sign up to request clarification or add additional context in comments.

1 Comment

group_by() would be a possibility, but I always try to avoid it because it is more computationally intensive. But unfortunately it seems there are no unique/distinct alike functions. I'll will then just use another aggregate function like max()/min()/mean() for such cases, since every value in a is equal anyway.
1

I might be misinterpreting, but this seems like a grouping operation, where you might want the sum of b for each unique value of a. If so, then group_by(a):

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b))
# Source:   lazy query [?? x 2]
# # Database: sqlite 3.33.0 [:memory:]
#       a aggregatedSum
#   <dbl>         <dbl>
# 1     2             9

This is related to How to combine SELECT DISTINCT and SUM() in that I believe SQL does not let you sum(.) and distinct(.) in the same query. The unchanged query looks like:

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  summarise(aggregatedSum = sum(b), 
          aggregatedUnique = distinct(a)) %>%
  show_query()
# <SQL>
# SELECT SUM(`b`) AS `aggregatedSum`, distinct(`a`) AS `aggregatedUnique`
# FROM `dbplyr_014`

whereas the updated query is

dbplyr::memdb_frame(a=c(2,2,2), b=c(2,3,4)) %>%
  group_by(a) %>%
  summarise(aggregatedSum = sum(b)) %>%
  show_query()
# <SQL>
# SELECT `a`, SUM(`b`) AS `aggregatedSum`
# FROM `dbplyr_016`
# GROUP BY `a`

which is aligned with the linked question/answer.

2 Comments

group_by() would be a possibility, but I always try to avoid it because it is more computationally intensive. But unfortunately it seems there are no unique/distinct alike functions. I'll will then just use another aggregate function like max()/min()/mean() for such cases, since every value in a is equal anyway.
The data-viewpoint here is confusing to me. I suspect that it is overly-simplified for the sake of a question on SO, where the variability of the larger real data is unknown to us. Good luck.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.