PostgreSQL: query with join and group by is taking too long

Question

I have two tables which look like this:

table_1:

-----------------------------------
| ID  |  customer_id  |   city    |
-----------------------------------
| 0   |  E100         |  Sydney   |
-----------------------------------
| 1   |  E200         |  Toronto  | 
-----------------------------------
| 2   |  E300         |  New York |
-----------------------------------

table_2:

----------------------------------------------
| customer_id  |    timestamp   |   receipt  |
----------------------------------------------
|    E200      |  '2019-03-25'  |    200$    | 
----------------------------------------------
|    E300      |  '2019-03-26'  |    300$    |
----------------------------------------------
|    E300      |  '2019-03-26'  |    100$    |
----------------------------------------------
|    E100      |  '2019-03-27'  |     50$    | 
----------------------------------------------
|    E100      |  '2019-03-28'  |     50$    |
----------------------------------------------
|    E100      |  '2019-03-29'  |     50$    |
----------------------------------------------

What I want to do is to, sum up all receipts for each distinct customer_id. The result table should look like the following:

----------------------------------------------
| customer_id |    city    |   sum(receipt)  |
----------------------------------------------
|    E100     |  Sydney    |      150$       |
----------------------------------------------
|    E200     |  Toronto   |      200$       | 
----------------------------------------------
|    E300     |  New York  |      400$       |
----------------------------------------------

In order to achieve this, I use the following PostgreSQL query:

SELECT a.customer_id, a.city, SUM(b.receipt) 
FROM public.table_1 a 
INNER JOIN public.table_2 b
   ON a.customer_id = b.customer_id
   WHERE b.timestamp > '2019-03-25 00:00:00' 
   AND b.timestamp < '2019-04-01 00:00:00' 
GROUP BY a.customer_id, a.city

However, as table_2 has more than 300mio rows and table_1 has 129 rows, the query is taking too long (I don't know how long exactly -> EXPLAIN ANALYZE on this query wasn't finishing as well). I guess the INNER JOIN is the bottle neck here (please correct me if I am wrong)? But I do know that the query is doing the right thing as I have tried it with filtering just one day (not one week).

My question is how to speed up this query. I have already considered adding an index like this:

CREATE INDEX table_2_index ON table_2(customer_id, timestamp)

But this query is also taking forever.

Any suggestions?

If the EXPLAIN ANALYZE takes too long you can use a simple EXPLAIN instead, a lot less useful but it still shows what the database is thinking. You can manually verify the steps after that (i.e. SELECT COUNT(*) FROM table WHERE ...) — Wolph
– Wolph, Commented Oct 2, 2019 at 7:50
As mentioned in answers below eventually "join then aggregate" could be slower then "aggregate then join" schema. In your case table public.table_2 should be the "master" so firstly you should to optimize the query like select customer_id, sum(receipt) from table_2 where timestamp > '2019-03-25 00:00:00' and timestamp < '2019-04-01 00:00:00' group by customer_id and then join it with table_1 (I believe that the customer_id have the same uniqueness as the pair (customer_id, city_id)) — Abelisto
– Abelisto, Commented Oct 2, 2019 at 7:59

Ed Bangga · Accepted Answer · 2019-10-02 07:44:38Z

3

lets try to filter your table_2 table first before joining.

SELECT a.customer_id, a.city, SUM(b.receipt) 
FROM public.table_1 a
INNER JOIN 
(SELECT receipt, customer_id FROM public.table_2 
    WHERE timestamp > '2019-03-25 00:00:00' 
    AND timestamp < '2019-04-01 00:00:00') b ON a.customer_id = b.customer_id
GROUP BY a.customer_id, a.city

answered Oct 2, 2019 at 7:44

Ed Bangga

13k4 gold badges18 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user7335295 Over a year ago

It takes 1 min now. Thank you so much!

user7335295 Over a year ago

Is there a way to add another column of receipt_sums for another timeintervall within one query?

Ed Bangga Over a year ago

copy entire inner join

INNER JOIN  (SELECT receipt, customer_id FROM public.table_2      WHERE timestamp > '2019-03-25 00:00:00'      AND timestamp < '2019-04-01 00:00:00') c ON a.customer_id = c.customer_id

, then add new column SUM(c.receipt)

user330315 · Accepted Answer · 2019-10-02 07:48:24Z

3

Try to aggregate first, then join:

SELECT a.customer_id, a.city, b.receipt_sum
FROM public.table_1 a 
 JOIN (
   SELECT t2.customer_id, sum(t2.receipt) as receipt_sum
   FROM public.table_2 t2
   WHERE t2.timestamp > '2019-03-25 00:00:00' 
     AND t2.timestamp < '2019-04-01 00:00:00' 
   GROUP BY t2.customer_id
 ) b ON a.customer_id = b.customer_id

answered Oct 2, 2019 at 7:48

user330315

Collectives™ on Stack Overflow

PostgreSQL: query with join and group by is taking too long

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related