6

I am trying to write a SQL query that returns rows from a table containing data:

The table structure is as follows:

CREATE TABLE person(
    id INT PRIMARY KEY,
    name TEXT,
    operation TEXT);

I want to return all unique rows of name which have not been "cancelled" out. A row is considered to be "cancelled" out if the operation is either "insert" or "delete" and there exists another row with the same name with the opposite operation.

For example, if I have the following rows

id   name   operation
1    bob    insert
2    bob    delete
3    bob    insert

The first 2 rows "cancel" each other out because they share the same name with opposite operations. So the query should return row 3.

Here is another example:

id   name   operation
1    bob    insert
2    bob    delete
3    bob    insert
4    bob    delete

In this case, rows 1 and 2 cancel out, and rows 3 and 4 cancel out. So the query should not return any rows.

Last example:

id   name   operation
1    bob    insert
2    bob    insert

In this case, rows 1 and 2 do not cancel out because the operations are not opposite. So the query should return both rows.

I have the following query which handles the first two scenarios, but it doesn't handle the final scenario.

Does anyone have any suggestion for a query that can handle all 3 scenarios?

SELECT MAX(id),name 
FROM person z 
WHERE operation IN ('insert','delete') 
GROUP BY name 
HAVING count(1) % 2 = 1;
2
  • Is order important? If you have INSERT, INSERT, DELETE which insert should be preserved? Commented Jan 11, 2012 at 17:16
  • @Martin - the latest INSERT row is most relevant, so I guess the 2nd one. Commented Jan 11, 2012 at 17:34

3 Answers 3

4

One way is to compare the operation counts. Since you'll also need to get the number of INSERTS or DELETES that correspond with InsertCount - deleteCount or InsertCount - deleteCount and since PostgreSQL supports window function you should be able to use row_number().

Note: I've not tested this so but according to this PostgreSQL manual Chapter 3. Advanced Features, 3.5 Window functions you can refer to a Window Function in an inline query

SELECT
       id, name
FROM
   (
    SELECT 
            row_number() over (partition by p.name, p.operation order by p.id desc) rn , 
            id,  
            p.Name,
            p.operation, 
            operationCounts.InsertCount,
            operationCounts.deleteCount

    FROM 
       Person p
    INNER JOIN (

        SELECT 
          SUM(CASE WHEN operation = 'insert' then 1 else 0 END) InsertCount,
          SUM(CASE WHEN operation = 'delete' then 1 else 0 END) deleteCount,
          name 
        FROM 
           person 
        GROUP BY
           name ) operationCounts
    ON p.name = operationCounts.name
    WHERE 
      operationCounts.InsertCount <> operationCounts.deleteCount) data
WHERE
      (rn <=  (InsertCount -  deleteCount)
      and operation = 'insert')
      OR
     (rn <=  (deleteCount -  InsertCount)
      and operation = 'delete')
Sign up to request clarification or add additional context in comments.

6 Comments

That is pretty neet but still missing something. You are giving only the max(p.id) and you should give more than that (see question' example 3).
I get the following error when I run the query: ERROR: column reference "name" is ambiguous LINE 6: row_number() over (partition by name, operation ...
@Jin sorry about that. I forgot the table aliases in ROW_NUMBER. I fixed it.
@Conrad Thanks for your great help. I am getting a different error now though: ERROR: missing FROM-clause entry for table "operationcounts" LINE 27: (rn <= (operationCounts.InsertCount - operationCount...
@Jin doh copy/paste error stikes again. I fixed it. In the context of that where we don't need an alias reference but if we did it would be data
|
1

Best speed and shortest answer: The problem can be reduced to

  1. count the delete operations for every name(cnt_del)
  2. neglect the the first cnt_del inserts

This can be written in one shot in this way:(don't know if everything from this query works)

select * from(
    SELECT id, name, 
       row_number() over (partition by name order by case 
                                                     when operation = 'insert' 
                                                     then id 
                                                     else null end 
                                            nulls last ) rnk_insert,
       count(case 
             when operation='delete' then 1 
             else null 
             end) over (partition by name) as cnt_del 
    FROM person z 
    WHERE operation IN ('insert','delete') 
)
where rnk_insert > cnt_del

If previous don't work in postgres(AFAIK, Oracle can handle it) the solution can be implemented in this more relaxed way:

select i.id, i.name 
from

  (select id, name, 
         row_number over (partition by name order by id) as rnk_insert
  from person z
  where operation='insert') i

  left join 

  (select name, count(*) as cnt_del
  from person z 
  where operation='delete') d

  on d.name = i.name

where rnk_insert > coalesce(cnt_del, 0)

Comments

0

Testing revealed my original query to be slower than @Conrad's excellent query. Humbled, I tried a few things and came up with a query that actually is simpler and faster.

Test setup

INSERT INTO person
SELECT i
      ,'name' || (random() * 500)::int::text
      ,CASE WHEN random() >= 0.5 THEN 'insert' ELSE 'delete' END
FROM   generate_series(1,10000) AS i;

Query:

SELECT id, name, operation
FROM  (
    SELECT row_number() OVER (PARTITION BY name, operation ORDER by id) AS rn
          ,id
          ,name
          ,operation
          ,y.cancel
    FROM  (
       SELECT name
             ,least(ct_del, ct_all - ct_del) AS cancel
       FROM  (
          SELECT name
                ,count(*) AS ct_all
                ,count(NULLIF(operation, 'insert')) AS ct_del
          FROM   person
          GROUP  BY 1
          )   x
       WHERE (ct_all - ct_del) <> ct_del
       )   y
    JOIN   person USING (name)
    )   p
WHERE  rn > cancel

It ended up to be similar to @Conrad's query with a few simplifications / improvements. The crucial point is to eliminate names that are cancelled out early in the game.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.