0

I have a data table that looks like this (from the CSV) outlining voting data. What I need to know is how many votes come in per day (average) by year, by doing a linear regression over votesneeded ~ dayuntilelection. The slope would be the average votes coming in per day.

How can I run a linear regression function over this dataframe by year?

date,year,daysuntilelection,votesneeded
2018-01-25,2018,9,40
2018-01-29,2018,5,13
2018-01-30,2018,4,-11
2018-02-03,2018,0,-28
2019-01-23,2019,17,81
2019-02-01,2019,8,-4
2019-02-09,2019,0,-44
2020-01-17,2020,22,119
2020-01-24,2020,15,58
2020-01-30,2020,9,12
2020-02-03,2020,5,-4
2020-02-07,2020,1,-12
2021-01-08,2021,29,120
2021-01-26,2021,11,35
2021-01-29,2021,8,17
2021-02-01,2021,5,-2
2021-02-03,2021,3,-8
2021-02-06,2021,0,-10

The preferred output would be a dataframe looking something like this

year     averagevotesperday
2018       8.27
2019       7.40
2020       6.55
2021       4.60

note: full data sets and analyses are at https://github.com/robhanssen/glenlake-elections, for the curious.

1 Answer 1

2

Do you need something like this?

library(dplyr) 

dat |>
    group_by(year) |>
    summarize(
        avgVoteDay = coef(lm(votesneeded ~ daysuntilelection))[2]
    )

Output is slightly differs from yours:

# A tibble: 4 x 2
   year avgvote_day
  <int>       <dbl>
1  2018        7.76
2  2019        7.40
3  2020        6.41
4  2021        4.74
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! The discrepancy in number is because I ran it individually on the larger dataset. Separate question: is |> the new version of %>% and will be the old pipe be replaced completely?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.