SQL: filtering rows

Question

I am trying to write a SQL query that returns rows from a table containing data:

The table structure is as follows:

CREATE TABLE person(
    id INT PRIMARY KEY,
    name TEXT,
    operation TEXT);

I want to return all unique rows of name which have not been "cancelled" out. A row is considered to be "cancelled" out if the operation is either "insert" or "delete" and there exists another row with the same name with the opposite operation.

For example, if I have the following rows

id   name   operation
1    bob    insert
2    bob    delete
3    bob    insert

The first 2 rows "cancel" each other out because they share the same name with opposite operations. So the query should return row 3.

Here is another example:

id   name   operation
1    bob    insert
2    bob    delete
3    bob    insert
4    bob    delete

In this case, rows 1 and 2 cancel out, and rows 3 and 4 cancel out. So the query should not return any rows.

Last example:

id   name   operation
1    bob    insert
2    bob    insert

In this case, rows 1 and 2 do not cancel out because the operations are not opposite. So the query should return both rows.

I have the following query which handles the first two scenarios, but it doesn't handle the final scenario.

Does anyone have any suggestion for a query that can handle all 3 scenarios?

SELECT MAX(id),name 
FROM person z 
WHERE operation IN ('insert','delete') 
GROUP BY name 
HAVING count(1) % 2 = 1;

Is order important? If you have INSERT, INSERT, DELETE which insert should be preserved? — Martin Smith
– Martin Smith, Commented Jan 11, 2012 at 17:16
@Martin - the latest INSERT row is most relevant, so I guess the 2nd one. — Jin Kim
– Jin Kim, Commented Jan 11, 2012 at 17:34

Conrad Frix · Accepted Answer · 2012-01-11 17:55:16Z

4

One way is to compare the operation counts. Since you'll also need to get the number of INSERTS or DELETES that correspond with InsertCount - deleteCount or InsertCount - deleteCount and since PostgreSQL supports window function you should be able to use row_number().

Note: I've not tested this so but according to this PostgreSQL manual Chapter 3. Advanced Features, 3.5 Window functions you can refer to a Window Function in an inline query

SELECT
       id, name
FROM
   (
    SELECT 
            row_number() over (partition by p.name, p.operation order by p.id desc) rn , 
            id,  
            p.Name,
            p.operation, 
            operationCounts.InsertCount,
            operationCounts.deleteCount

    FROM 
       Person p
    INNER JOIN (

        SELECT 
          SUM(CASE WHEN operation = 'insert' then 1 else 0 END) InsertCount,
          SUM(CASE WHEN operation = 'delete' then 1 else 0 END) deleteCount,
          name 
        FROM 
           person 
        GROUP BY
           name ) operationCounts
    ON p.name = operationCounts.name
    WHERE 
      operationCounts.InsertCount <> operationCounts.deleteCount) data
WHERE
      (rn <=  (InsertCount -  deleteCount)
      and operation = 'insert')
      OR
     (rn <=  (deleteCount -  InsertCount)
      and operation = 'delete')

edited Jan 11, 2012 at 17:55

answered Jan 11, 2012 at 17:13

Conrad Frix

52.8k13 gold badges99 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

aF. Over a year ago

That is pretty neet but still missing something. You are giving only the max(p.id) and you should give more than that (see question' example 3).

Jin Kim Over a year ago

I get the following error when I run the query: ERROR: column reference "name" is ambiguous LINE 6: row_number() over (partition by name, operation ...

Conrad Frix Over a year ago

@Jin sorry about that. I forgot the table aliases in ROW_NUMBER. I fixed it.

Jin Kim Over a year ago

@Conrad Thanks for your great help. I am getting a different error now though: ERROR: missing FROM-clause entry for table "operationcounts" LINE 27: (rn <= (operationCounts.InsertCount - operationCount...

Conrad Frix Over a year ago

@Jin doh copy/paste error stikes again. I fixed it. In the context of that where we don't need an alias reference but if we did it would be data

|

Florin Ghita · Accepted Answer · 2012-01-11 20:39:48Z

Best speed and shortest answer: The problem can be reduced to

count the delete operations for every name(cnt_del)
neglect the the first cnt_del inserts

This can be written in one shot in this way:(don't know if everything from this query works)

select * from(
    SELECT id, name, 
       row_number() over (partition by name order by case 
                                                     when operation = 'insert' 
                                                     then id 
                                                     else null end 
                                            nulls last ) rnk_insert,
       count(case 
             when operation='delete' then 1 
             else null 
             end) over (partition by name) as cnt_del 
    FROM person z 
    WHERE operation IN ('insert','delete') 
)
where rnk_insert > cnt_del

If previous don't work in postgres(AFAIK, Oracle can handle it) the solution can be implemented in this more relaxed way:

select i.id, i.name 
from

  (select id, name, 
         row_number over (partition by name order by id) as rnk_insert
  from person z
  where operation='insert') i

  left join 

  (select name, count(*) as cnt_del
  from person z 
  where operation='delete') d

  on d.name = i.name

where rnk_insert > coalesce(cnt_del, 0)

Erwin Brandstetter · Accepted Answer · 2012-01-24 00:37:06Z

Testing revealed my original query to be slower than @Conrad's excellent query. Humbled, I tried a few things and came up with a query that actually is simpler and faster.

Test setup

INSERT INTO person
SELECT i
      ,'name' || (random() * 500)::int::text
      ,CASE WHEN random() >= 0.5 THEN 'insert' ELSE 'delete' END
FROM   generate_series(1,10000) AS i;

Query:

SELECT id, name, operation
FROM  (
    SELECT row_number() OVER (PARTITION BY name, operation ORDER by id) AS rn
          ,id
          ,name
          ,operation
          ,y.cancel
    FROM  (
       SELECT name
             ,least(ct_del, ct_all - ct_del) AS cancel
       FROM  (
          SELECT name
                ,count(*) AS ct_all
                ,count(NULLIF(operation, 'insert')) AS ct_del
          FROM   person
          GROUP  BY 1
          )   x
       WHERE (ct_all - ct_del) <> ct_del
       )   y
    JOIN   person USING (name)
    )   p
WHERE  rn > cancel

It ended up to be similar to @Conrad's query with a few simplifications / improvements. The crucial point is to eliminate names that are cancelled out early in the game.

Collectives™ on Stack Overflow

SQL: filtering rows

3 Answers 3

6 Comments

Comments

Test setup

Query:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Test setup

Query:

Comments

Your Answer

Sign up or log in

Post as a guest

Related