1

I have the following block of code that calculates the formula for a trend line using linear regression (method of least squares). It just find the R-Squared and coefficient of correlation value for X and Y axis.

This will calculate the exact value if X and Y axis are int and float.

CREATE FUNCTION [dbo].[LinearReqression] (@Data AS XML)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  x = n.value('@x', 'float'),
                y = n.value('@y', 'float')
        FROM @Data.nodes('/r/n') v(n)
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)

Usage:

DECLARE @DataTable TABLE (
    SourceID    INT,
    x           Date,
    y           FLOAT
) ;
INSERT INTO @DataTable ( SourceID, x, y )
SELECT ID = 0, x = 1.2, y = 1.0
UNION ALL SELECT 1, 1.6, 1
UNION ALL SELECT 2, 2.0, 1.5
UNION ALL SELECT 3, 2.0, 1.75
UNION ALL SELECT 4, 2.1, 1.85
UNION ALL SELECT 5, 2.1, 2
UNION ALL SELECT 6, 2.2, 3
UNION ALL SELECT 7, 2.2, 3
UNION ALL SELECT 8, 2.3, 3.5
UNION ALL SELECT 9, 2.4, 4
UNION ALL SELECT 10, 2.5, 4
UNION ALL SELECT 11, 3, 4.5 ;

-- Create and view XML data array
DECLARE @DataXML XML ;
SET @DataXML = (
    SELECT  -- FLOAT values are formatted in XML like "1.000000000000000e+000", increasing the character count
            -- Converting them to VARCHAR first keeps the XML small without sacrificing precision
            -- They are unpacked as FLOAT in the function either way
            [@x] = CAST(x AS VARCHAR(20)), 
            [@y] = CAST(y AS VARCHAR(20))
    FROM @DataTable
    FOR XML PATH('n'), ROOT('r') ) ;

SELECT @DataXML ;

-- Get the results
SELECT * FROM dbo.LinearReqression (@DataXML) ;

In my case X axis may be Date column also? So how can I calculate same regression analysis with date columns?

1
  • 1
    Date can be converted to float (fractional day since Jan 1, 1970) or bigint (as seconds from any point in time that you choose) Commented Dec 27, 2014 at 10:19

1 Answer 1

2

Short answer is: calculating trend line for dates is pretty much the same as calculating trend line for floats.

For dates you can choose some starting date and use number of days between the starting date and your dates as an X.

I didn't check your function itself and I assume that formulas there are correct.

Also, I don't understand why you generate XML out of the table and parse it back into the table inside the function. It is rather inefficient. You can simply pass the table.

I used your function to make two variants: for processing floats and for processing dates. I'm using SQL Server 2008 for this example.

At first create a user-defined table type, so we could pass a table into the function:

CREATE TYPE [dbo].[FloatRegressionDataTableType] AS TABLE(
    [x] [float] NOT NULL,
    [y] [float] NOT NULL
)
GO

Then create the function that accepts such table:

CREATE FUNCTION [dbo].[LinearRegressionFloat] (@ParamData dbo.FloatRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  x,
                y
        FROM @ParamData
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)
GO

Very similarly, create a type for table with dates:

CREATE TYPE [dbo].[DateRegressionDataTableType] AS TABLE(
    [x] [date] NOT NULL,
    [y] [float] NOT NULL
)
GO

And create a function that accepts such table. For each given date it calculates the number of days between 2001-01-01 and the given date x using DATEDIFF and then casts the result to float to make sure that the rest of calculations is correct. You can try to remove the cast to float and you'll see the different result. You can choose any other starting date, it doesn't have to be 2001-01-01.

CREATE FUNCTION [dbo].[LinearRegressionDate] (@ParamData dbo.DateRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  CAST(DATEDIFF(day, '2001-01-01', x) AS float) AS x,
                y
        FROM @ParamData
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)
GO

This is how to test the functions:

-- test float data
DECLARE @FloatDataTable [dbo].[FloatRegressionDataTableType];

INSERT INTO @FloatDataTable (x, y)
VALUES
(1.2, 1.0)
,(1.6, 1)
,(2.0, 1.5)
,(2.0, 1.75)
,(2.1, 1.85)
,(2.1, 2)
,(2.2, 3)
,(2.2, 3)
,(2.3, 3.5)
,(2.4, 4)
,(2.5, 4)
,(3, 4.5);

SELECT * FROM dbo.LinearRegressionFloat(@FloatDataTable);


-- test date data
DECLARE @DateDataTable [dbo].[DateRegressionDataTableType];

INSERT INTO @DateDataTable (x, y)
VALUES
 ('2001-01-13', 1.0)
,('2001-01-17', 1)
,('2001-01-21', 1.5)
,('2001-01-21', 1.75)
,('2001-01-22', 1.85)
,('2001-01-22', 2)
,('2001-01-23', 3)
,('2001-01-23', 3)
,('2001-01-24', 3.5)
,('2001-01-25', 4)
,('2001-01-26', 4)
,('2001-01-31', 4.5);

SELECT * FROM dbo.LinearRegressionDate(@DateDataTable);

Here are two result sets:

r_squared            Alpha                Beta
----------------------------------------------------------
0.798224907472009    -2.66524390243902    2.46417682926829


r_squared            Alpha                Beta
----------------------------------------------------------
0.79822490747201     -2.66524390243902    0.246417682926829
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.