I have a database that contains election results at the voting precinct level. I also have the age composition of each voting precinct, specifically, the percentage of voters over the age of 65. I want to calculate the linear regression of each election, comparing the support for each candidate to the percentage of voters over 65 to see if there is any relationship between the two. For example, do precinct's with a higher percentage of people over 65 support candidate X at a higher rate?
I can do this for a single election, but I'm wondering if there's a way to automate the process for the thousands of elections I have. Even if I have to break it up into multiple steps. I am completely at a loss on how to do this, so anything to nudge me in the right direction would be helpful.
My tables are:
results table:
id
contest_id
precinct_id
candidate_name
candidate_votes
precinct_table:
precinct_id
percent_over_65
Select all candidates where contest = 1, with brackets around the name so I can run a pivot table
select distinct ',' + quotename(candidate_name) as column_name
from results
where contest = 1
I then copy the list of candidates into the pivot table code
select * into temp_table1 from
(select precinct_id, candidate_name, votes from results) as base_data
pivot (
sum(votes)
for candidate_name
in (
[Jane Doe]
,[John Does])
) as pivot_table
where [Jane Doe] is not null and [John Doe] > 0 and [Jane Doe] > 0 -- prevent divide by 0 errors
I then run the code to convert the integers to decimals and create alias for the candidates names
select precinct_id as p, cast([Jane Does] as decimal(15,10)) as a, cast([John Doe] as decimal(15,10)) as b into temp_table2
from temp_table
I then create ANOTHER table to get a nice and clean table that's ready for regression analysis
select p, a/(a+b) as a_support, cast(over_65 as decimal(15,10)) / cast(total_people as decimal(15,10)) as percent_over_65 into temp_table3
from temp_table2
inner join precinct on precinct.id = temp_table2.p
At this point, I finally have a table that looks like this
p a_support percent_over_65
1 .55 .78
2 .33 .45
3 .34 .65
Now I finally have my data cleaned and ready to go to run the regression analysis
declare @n as decimal(15,10)
select @n = count(*) from temp_table3
select (@n * sum(percent_over_65 * a_support) - sum(percent_over_65) * sum(a_support)) / (@n * sum(percent_over_65 * percent_over_65) - sum(percent_over_65)*sum(percent_over_65)) as m from temp_table3
select avg(a_support) - avg(percent_over_65) * (@n * sum(percent_over_65 * a_support) - sum(percent_over_65) * sum(a_support)) / (@n * sum(percent_over_65 * percent_over_65) - sum(percent_over_65)*sum(percent_over_65)) as intercept
from temp_table3
That's all fine and dandy if I only have one contest to run analysis on. But what if I have 100 different contests to run analysis on. Is there anything I can to expedite this process? I realize that maybe this is something outside of the scope of SQL, but I figure I would ask before I give up. Please let me know if anyone has any suggestions, even if it's something outside of SQL. Is this something that R can do?
I can easily get to the point of getting a table that looks like below, but that's about as far as I can get.
Precinct_ID Contest1_candidateA Contest1_candidateB Contest2_candidateA Contest2_candidateB
1 100 200 NULL NULL
2 200 200 200 150
3 NULL NULL 150 250
4 500 100 100 300
Thanks!