2

I have two tables which look like this:

table_1:

-----------------------------------
| ID  |  customer_id  |   city    |
-----------------------------------
| 0   |  E100         |  Sydney   |
-----------------------------------
| 1   |  E200         |  Toronto  | 
-----------------------------------
| 2   |  E300         |  New York |
-----------------------------------

table_2:

----------------------------------------------
| customer_id  |    timestamp   |   receipt  |
----------------------------------------------
|    E200      |  '2019-03-25'  |    200$    | 
----------------------------------------------
|    E300      |  '2019-03-26'  |    300$    |
----------------------------------------------
|    E300      |  '2019-03-26'  |    100$    |
----------------------------------------------
|    E100      |  '2019-03-27'  |     50$    | 
----------------------------------------------
|    E100      |  '2019-03-28'  |     50$    |
----------------------------------------------
|    E100      |  '2019-03-29'  |     50$    |
----------------------------------------------

What I want to do is to, sum up all receipts for each distinct customer_id. The result table should look like the following:

----------------------------------------------
| customer_id |    city    |   sum(receipt)  |
----------------------------------------------
|    E100     |  Sydney    |      150$       |
----------------------------------------------
|    E200     |  Toronto   |      200$       | 
----------------------------------------------
|    E300     |  New York  |      400$       |
----------------------------------------------

In order to achieve this, I use the following PostgreSQL query:

SELECT a.customer_id, a.city, SUM(b.receipt) 
FROM public.table_1 a 
INNER JOIN public.table_2 b
   ON a.customer_id = b.customer_id
   WHERE b.timestamp > '2019-03-25 00:00:00' 
   AND b.timestamp < '2019-04-01 00:00:00' 
GROUP BY a.customer_id, a.city

However, as table_2 has more than 300mio rows and table_1 has 129 rows, the query is taking too long (I don't know how long exactly -> EXPLAIN ANALYZE on this query wasn't finishing as well). I guess the INNER JOIN is the bottle neck here (please correct me if I am wrong)? But I do know that the query is doing the right thing as I have tried it with filtering just one day (not one week).

My question is how to speed up this query. I have already considered adding an index like this:

CREATE INDEX table_2_index ON table_2(customer_id, timestamp)

But this query is also taking forever.

Any suggestions?

2
  • If the EXPLAIN ANALYZE takes too long you can use a simple EXPLAIN instead, a lot less useful but it still shows what the database is thinking. You can manually verify the steps after that (i.e. SELECT COUNT(*) FROM table WHERE ...) Commented Oct 2, 2019 at 7:50
  • As mentioned in answers below eventually "join then aggregate" could be slower then "aggregate then join" schema. In your case table public.table_2 should be the "master" so firstly you should to optimize the query like select customer_id, sum(receipt) from table_2 where timestamp > '2019-03-25 00:00:00' and timestamp < '2019-04-01 00:00:00' group by customer_id and then join it with table_1 (I believe that the customer_id have the same uniqueness as the pair (customer_id, city_id)) Commented Oct 2, 2019 at 7:59

2 Answers 2

3

lets try to filter your table_2 table first before joining.

SELECT a.customer_id, a.city, SUM(b.receipt) 
FROM public.table_1 a
INNER JOIN 
(SELECT receipt, customer_id FROM public.table_2 
    WHERE timestamp > '2019-03-25 00:00:00' 
    AND timestamp < '2019-04-01 00:00:00') b ON a.customer_id = b.customer_id
GROUP BY a.customer_id, a.city
Sign up to request clarification or add additional context in comments.

3 Comments

It takes 1 min now. Thank you so much!
Is there a way to add another column of receipt_sums for another timeintervall within one query?
copy entire inner join INNER JOIN (SELECT receipt, customer_id FROM public.table_2 WHERE timestamp > '2019-03-25 00:00:00' AND timestamp < '2019-04-01 00:00:00') c ON a.customer_id = c.customer_id, then add new column SUM(c.receipt)
3

Try to aggregate first, then join:

SELECT a.customer_id, a.city, b.receipt_sum
FROM public.table_1 a 
 JOIN (
   SELECT t2.customer_id, sum(t2.receipt) as receipt_sum
   FROM public.table_2 t2
   WHERE t2.timestamp > '2019-03-25 00:00:00' 
     AND t2.timestamp < '2019-04-01 00:00:00' 
   GROUP BY t2.customer_id
 ) b ON a.customer_id = b.customer_id

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.