Automate Regression Analysis in SQL Database

Question

I have a database that contains election results at the voting precinct level. I also have the age composition of each voting precinct, specifically, the percentage of voters over the age of 65. I want to calculate the linear regression of each election, comparing the support for each candidate to the percentage of voters over 65 to see if there is any relationship between the two. For example, do precinct's with a higher percentage of people over 65 support candidate X at a higher rate?

I can do this for a single election, but I'm wondering if there's a way to automate the process for the thousands of elections I have. Even if I have to break it up into multiple steps. I am completely at a loss on how to do this, so anything to nudge me in the right direction would be helpful.

My tables are:

results table:
id
contest_id
precinct_id
candidate_name
candidate_votes

precinct_table:
precinct_id
percent_over_65

Select all candidates where contest = 1, with brackets around the name so I can run a pivot table

select distinct ',' + quotename(candidate_name) as column_name
from results
where contest = 1

I then copy the list of candidates into the pivot table code

select * into temp_table1 from 
(select precinct_id, candidate_name, votes from results) as base_data
pivot (
    sum(votes)
    for candidate_name
    in (
    [Jane Doe] 
    ,[John Does])   
    ) as pivot_table
where [Jane Doe] is not null and [John Doe] > 0 and [Jane Doe] > 0 -- prevent divide by 0 errors

I then run the code to convert the integers to decimals and create alias for the candidates names

select precinct_id as p, cast([Jane Does] as decimal(15,10)) as a, cast([John Doe] as decimal(15,10)) as b into temp_table2
from temp_table

I then create ANOTHER table to get a nice and clean table that's ready for regression analysis

select p, a/(a+b) as a_support, cast(over_65 as decimal(15,10))  / cast(total_people as decimal(15,10)) as percent_over_65 into temp_table3
from temp_table2
inner join precinct on precinct.id = temp_table2.p

At this point, I finally have a table that looks like this

p     a_support      percent_over_65
1        .55               .78   
2        .33               .45    
3        .34               .65

Now I finally have my data cleaned and ready to go to run the regression analysis

declare @n as decimal(15,10)
select @n = count(*) from temp_table3
select (@n * sum(percent_over_65 * a_support) - sum(percent_over_65) * sum(a_support)) / (@n * sum(percent_over_65 * percent_over_65) - sum(percent_over_65)*sum(percent_over_65)) as m from temp_table3
select avg(a_support) - avg(percent_over_65) * (@n * sum(percent_over_65 * a_support) - sum(percent_over_65) * sum(a_support)) / (@n * sum(percent_over_65 * percent_over_65) - sum(percent_over_65)*sum(percent_over_65)) as intercept
from temp_table3

That's all fine and dandy if I only have one contest to run analysis on. But what if I have 100 different contests to run analysis on. Is there anything I can to expedite this process? I realize that maybe this is something outside of the scope of SQL, but I figure I would ask before I give up. Please let me know if anyone has any suggestions, even if it's something outside of SQL. Is this something that R can do?

I can easily get to the point of getting a table that looks like below, but that's about as far as I can get.

Precinct_ID     Contest1_candidateA    Contest1_candidateB     Contest2_candidateA   Contest2_candidateB
1                     100                     200                     NULL                   NULL
2                     200                     200                     200                     150
3                     NULL                    NULL                    150                     250
4                     500                     100                     100                     300

Thanks!

Parfait · Accepted Answer · 2020-07-04 04:15:55Z

1

Simply understand the tools at your disposal. Use R for data science and SQL Server for data storage or persistence. Recall SQL is a special-purpose, declarative language (i.e., designed for very narrow specific, set-based operations) and R is a general purpose interpretive language primarily used as a statistical environment (i.e., designed to do most anything requiring mathematical and statistical calculation).

So yes, use R for automating your regression analysis with data dervied from SQL Server. Even better, for SQL Server 2016, 2017, and 2019 for Windows, you can run R directly inside SQL Server with R Services!

Therefore, consider following steps:

Connect R and SQL Server (see odbc package or run special stored proc).
Query all election data and import data into data frame.
Re-format data to wide structure with reshape (similar to PIVOT).
Convert numeric data with as.numeric (similar to CAST).
Run needed regression with lm or glm.

Aside, sounds like an interesting data and analytics project! OPs have all the fun!

answered Jul 4, 2020 at 4:15

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Billy B Over a year ago

This is exactly what I was hoping for. I'm fairly new to R and didn't know where to start, so now I have my work cut out for me! Thanks for the help!!

George Joseph · Accepted Answer · 2020-07-04 12:46:09Z

I am in complete agreement with the comments by @Parafait, you would get a much richer set of functions and statistical tools using R. Also it would be the right tool for the job, for things related to statistical computation.

Coming to the question, i believe you have been able to successfully obtain the necessary computation for a single contest and now you wish to include the full list of contests for computation.

If you are able to create a stored procedure for getting the desired output,by passing a contest_id as input to the procedure, then you can call this stored procedure via any ETL tool, eg: SSIS and use a for-loop container. The for-loop would take in the contest_id and compute the results one after the other.

Further if you would like to optimize, you may run parallel workflows using SSIS in a sequence container that would independently kick off the store procedure in a parallel manner.

Collectives™ on Stack Overflow

Automate Regression Analysis in SQL Database

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related