3

In my table each row has some data columns Priority column (for example, timestamp or just an integer). I want to group my data by ID and then in each group take latest not-null column. For example I have following table:

id  A       B       C       Priority
1   NULL    3       4       1
1   5       6       NULL    2
1   8       NULL    NULL    3
2   634     346     359     1
2   34      NULL    734     2

Desired result is :

id  A   B   C   
1   8   6   4   
2   34  346 734 

In this example table is small and has only 5 columns, but in real table it will be much larger. I really want this script to work fast. I tried do it myself, but my script works for SQLSERVER2012+ so I deleted it as not applicable.

Numbers: table could have 150k of rows, 20 columns, 20-80k of unique ids and average SELECT COUNT(id) FROM T GROUP BY ID is 2..5

Now I have a working code (thanks to @ypercubeᵀᴹ), but it runs very slowly on big tables, in my case script can take one minute or even more (with indices and so on).

How can it be speeded up?

SELECT 
    d.id,
    d1.A,
    d2.B,
    d3.C
FROM 
    ( SELECT id
      FROM T
      GROUP BY id
    ) AS d
  OUTER APPLY
    ( SELECT TOP (1) A
      FROM T 
      WHERE id = d.id
        AND A IS NOT NULL
      ORDER BY priority DESC
    ) AS d1 
  OUTER APPLY
    ( SELECT TOP (1) B
      FROM T 
      WHERE id = d.id
        AND B IS NOT NULL
      ORDER BY priority DESC
    ) AS d2 
  OUTER APPLY
    ( SELECT TOP (1) C
      FROM T 
      WHERE id = d.id
        AND C IS NOT NULL
      ORDER BY priority DESC
    ) AS d3 ;

In my test database with real amount of data I get following execution plan: enter image description here

4 Answers 4

4

This should do the trick, everything raised to the power 0 will return 1 except null:

DECLARE @t table(id int,A int,B  int,C int,Priority int)
INSERT @t
VALUES (1,NULL,3   ,4   ,1),
(1,5   ,6   ,NULL,2),(1,8   ,NULL,NULL,3),
(2,634 ,346 ,359 ,1),(2,34  ,NULL,734 ,2)

;WITH CTE as
(
  SELECT id, 
  CASE WHEN row_number() over 
    (partition by id order by Priority*power(A,0) desc) = 1 THEN A END A,
  CASE WHEN row_number() over 
    (partition by id order by Priority*power(B,0) desc) = 1 THEN B END B,
  CASE WHEN row_number() over 
    (partition by id order by Priority*power(C,0) desc) = 1 THEN C END C
  FROM @t
)
SELECT id, max(a) a, max(b) b, max(c) c
FROM CTE
GROUP BY id

Result:

id  a   b   c
1   8   6   4
2   34  346 734
Sign up to request clarification or add additional context in comments.

3 Comments

It looks tricky but it perform everything in 1 second. Thank you, it really seems the best way to fulfill my needs.
@Alex Zhukovskiy: It's a math trick where you can easily replace Priority*power(A,0) with case when a is not null then priority end which does exactly the same. If you still find this not easy to read because one has to know how the particular DBMS handles nulls in ORDER BY, you can extend this to case when a is not null then priority else -1 end. Anyway it's a good way to approach the problem. 1 second is surprising, though. I tried the same with a rather big table (some million records) and it takes several minutes to complete on my system.
@ThorstenKettner my tables has good indices and they all are in 100k...500k of rows. So they are not very big, but fullscan for every column (as in my original post) is definitly non applicable.
2

One alternative that might be faster is a multiple join approach. Get the priority for each column and then join back to the original table. For the first part:

select id,
       max(case when a is not null then priority end) as pa,
       max(case when b is not null then priority end) as pb,
       max(case when c is not null then priority end) as pc
from t
group by id;

Then join back to this table:

with pabc as (
      select id,
             max(case when a is not null then priority end) as pa,
             max(case when b is not null then priority end) as pb,
             max(case when c is not null then priority end) as pc
      from t
      group by id
     )
select pabc.id, ta.a, tb.b, tc.c
from pabc left join
     t ta
     on pabc.id = ta.id and pabc.pa = ta.priority left join
     t tb
     on pabc.id = tb.id and pabc.pb = tb.priority left join
     t tc
     on pabc.id = tc.id and pabc.pc = tc.priority ;

This can also take advantage of an index on t(id, priority).

Comments

0

previous code will work with following syntax:

 with pabc as (
          select id,
                 max(case when a is not null then priority end) as pa,
                 max(case when b is not null then priority end) as pb,
                 max(case when c is not null then priority end) as pc
          from t
          group by id
         )
    select pabc.Id,ta.a, tb.b, tc.c
    from pabc 
         left join t ta on pabc.id = ta.id and  pabc.pa = ta.priority 
         left join t tb on pabc.id = tb.id and pabc.pb = tb.priority 
         left join t tc on pabc.id = tc.id and pabc.pc = tc.priority ;

2 Comments

isn't this the same answer as mr. Gordons except a different format ?
as i mention that "previous code will work in following syntax". Sorry that i didn't mention that it is mr. Gordons code. His code has mistake - and as i can see someone edited his code now...
-1

This looks rather strange. You have a log table for all column changes, but no associated table with current data. Now you are looking for a query to collect your current values from the log table, which is a laborious task naturally.

The solution is simple: have an additional table with the current data. You can even link the tables with a trigger (so either every time a record gets inserted in your log table you update the current table or everytime a change is written to the current table you write a log entry).

Then just query your current table:

select id, a, b, c from currenttable order by id;

6 Comments

I have a lot of inserts and then one single select. So it's better to group data when select is performing, not when each row is inserted.
Adding calculated data like this is a bad idea, which adds both unnecessary overhead and complexity to the database at the very least and introduces the possibility of incorrect data at the worst.
@Tom H: I don't agree. Let's take a product table. Of course I can have a history table showing when the price changed, when stock changed, when minimum order amount changed and so on. But you will certainly agree that one should have an additional product table, showing the current price, stock and min order. It wouldn't make much sense to read the current data from the log table as is the case with the OP's query.
But well, if that query just has to be run say once a year and you never need to see current values otherwise, then this is a very special case where you can live with a log table alone. And if there are really such a lot of inserts, maybe by the millisecond, then this is a very special case, too, which is information I also didn't have when writing this answer.
In the case of a product table, I wouldn't be adding a product table to use for the result of calculated values that now have to be maintained. I would start with a product table that shows the current state of data. Any changes affect that table (or set of tables) directly. If I then have an additional requirement that I need to capture historical information for a product then I would add that functionality with a log table - the difference being that the log table is generated from actions in the system and as a result, data can never be "out of sync" because something was left uncalculated
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.