-1

I have this table in SQL server :

date var val
2022-2-1 A 1.1
2022-3-1 A 2.3
2022-4-1 A 1.5
2022-5-1 A 1.7
2022-09-1 B 1.8
2022-10-1 B 1.9
2022-11-1 B 2.1
2022-12-1 B 2.22

I want to group by column var and date and implement the expanding average and standard deviation in order to create an upper and lower interval of one standard deviation. At the end I want each entry in column val (apart from the first one) to be checked if falls inside the interval of the previous time (or lag 1).

How can I do it in SQL Server?

My attempt

-- Create the table
CREATE TABLE DataTable (
    [date] DATE,
    var CHAR(1),
    val DECIMAL(4, 2)
);

-- Insert the data
INSERT INTO DataTable ([date], var, val)
VALUES 
('2022-02-01', 'A', 1.1),
('2022-03-01', 'A', 2.3),
('2022-04-01', 'A', 1.5),
('2022-05-01', 'A', 1.7),
('2022-09-01', 'B', 1.8),
('2022-10-01', 'B', 1.9),
('2022-11-01', 'B', 2.1),
('2022-12-01', 'B', 2.22);


WITH DataWithStats AS (
    SELECT 
        [date],
        var,
        val,
        AVG(val) OVER (PARTITION BY var ORDER BY [date] 
                       ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_avg,
        STDEV(val) OVER (PARTITION BY var ORDER BY [date] 
                          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_std
    FROM DataTable
)
SELECT 
    [date],
    var,
    exp_avg - 2 * exp_std AS lower,
    val,
    exp_avg + 2 * exp_std AS upper,
    CASE 
        WHEN val < LAG(exp_avg - 2 * exp_std, 1) OVER (PARTITION BY var ORDER BY [date]) OR  
             val > LAG(exp_avg + 2 * exp_std, 1) OVER (PARTITION BY var ORDER BY [date]) 
        THEN 'Warning'
        ELSE 'ok'
    END AS status
FROM DataWithStats;

The problem with my attempt is that the check column does not properly evaluate the check.

https://sqlfiddle.com/sql-server/online-compiler?id=fe2e764f-9627-418f-bd08-dd13250b0dab

12
  • 2
    Expected results as well would help us help you. Commented Oct 12, 2024 at 16:17
  • I want to group by column var and date [...] is not what I'm seeing in your code sample. Also, what column is the check column? Be more clear please. Commented Oct 12, 2024 at 16:20
  • 1
    What is incorrect? If you believe the "warning" status is incorrect because 2.10 is not outside the 1.62782795366962 - 2.23883804633038 range, your problem is that your displayed values are not the ones used to set the status. The LAG() average and standard deviation values are use. The 2.10 value is outside the actual 1.7085786437627 - 1.9914213562373 range used by the test. See this db<>fiddle. Commented Oct 12, 2024 at 18:25
  • 1
    Side note: Your description states "an upper and lower interval of one standard deviation", but your calculation appears to be calculating and interval of two standard deviations - exp_avg +/- 2 * exp_std. Commented Oct 12, 2024 at 18:30
  • 1
    "I want each entry in column val (apart from the first one) to be checked..." - If only using prior values to calculate average and standard deviation, you will need at least two prior values before you can calculate a non-zero standard deviation. Commented Oct 12, 2024 at 20:17

1 Answer 1

1

If you believe the "warning" status is incorrect in the query results, because 2.10 is not outside the displayed 1.62782795366962 - 2.23883804633038 range, then I believe the issue that your displayed values are not the ones used to calculate the status. The LAG() average and standard deviation values are used instead of the current values. The 2.10 value is outside the actual 1.7085786437627 - 1.9914213562373 range used by the test.

The results were correct (as coded), but the presentation was misleading.

Original results:

date var lower val upper status
2022-02-01 A null 1.10 null ok
2022-03-01 A 0.00294372515228614 2.30 3.39705627484771 ok
2022-04-01 A 0.411312814678444 1.50 2.85535318532156 ok
2022-05-01 A 0.650000000000001 1.70 2.65 ok
2022-09-01 B null 1.80 null ok
2022-10-01 B 1.7085786437627 1.90 1.9914213562373 ok
2022-11-01 B 1.62782795366962 2.10 2.23883804633038 Warning
2022-12-01 B 1.62500000000001 2.22 2.38499999999999 ok

Results showing actual range used by status calculation.

date var lower val upper status
2022-02-01 A null 1.10 null ok
2022-03-01 A null 2.30 null ok
2022-04-01 A 0.00294372515228614 1.50 3.39705627484771 ok
2022-05-01 A 0.411312814678444 1.70 2.85535318532156 ok
2022-09-01 B null 1.80 null ok
2022-10-01 B null 1.90 null ok
2022-11-01 B 1.7085786437627 2.10 1.9914213562373 Warning
2022-12-01 B 1.62782795366962 2.22 2.23883804633038 ok

Note that STDEV() will be null and no range will be calculated until there are at least two prior values.

The following query modifies the window range in the CTE, so that LAG() is not needed in the final query. The displayed range limit values are the same as are used in the test.

WITH DataWithStats AS (
    SELECT 
        [date],
        var,
        val,
        AVG(val) OVER (PARTITION BY var ORDER BY [date] 
                       ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS exp_avg,
        STDEV(val) OVER (PARTITION BY var ORDER BY [date] 
                          ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS exp_std
    FROM DataTable
)
SELECT 
    [date],
    var,
    exp_avg - 2 * exp_std AS lower,
    val,
    exp_avg + 2 * exp_std AS upper,
    CASE 
        WHEN val < exp_avg - 2 * exp_std OR  
             val > exp_avg + 2 * exp_std
        THEN 'Warning'
        ELSE 'ok'
    END AS status
FROM DataWithStats;

See this db<>fiddle.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.