Introduction

Welcome to the first user-guided practical on linear regression and GLMs. In this practical you will:

  • Fit a basic linear regression model
  • Check the model fit and the residuals
  • Fit a binomial GLM model
  • Fit a Poisson GLM and compare it to a negative binomial one

There are four sections. You should work your way through the questions and put your hand up if you get stuck. There is an answer script given at the bottom but please try not to use it unless you are completely stuck and cannot ask a question.


You can run the code from these practicals is by loading up the .Rmd (Rmarkdown) file in the same directory in Rstudio. Feel free to add in your own answers, or edit the text to give yourself extra notes. You can also run the code directly by highlighting the relevant code and clicking Run. Much of this material overlaps with the class slides so sometimes if you get stuck you might get a clue by looking at the .Rmd file in the slides folder.

One final small note: if you are copying R commands from the pdf or html files into your R script or console window sometimes the inverted commas can copy across incorrectly. If you get a weird message saying Error: unexpected input you usually just need to delete/replace the inverted commas.

Task set 1: linear regression

The airquality data set is included with R. You can find a description of the data at help(airquality) and can get at it simply by typing airquality at the R command prompt. Let’s suppose we’re interested in predicting the variable Ozone from the other variables

Tasks:

  1. First create some plots of the variables and make sure you understand the relationship between them. Hint: A good start can be found in the help(airquality) file (see the example at the bottom of that page).

  2. Fit a linear regression model using lm with Ozone as the response and all the other variables as covariates. Use the summary method to interpret your findings. (Note that R here is automatically removing rows with missing variables)

  3. Have a look at the residuals of the model (e.g. histograms and QQ-plots). Does the model fit well?

  4. Try another model but this time using the log of Ozone instead. Does it fit better?

  5. Identify the one strange observation and see if you can work out what happened that day

  6. You can get the AIC of this model with e.g. AIC(my_model). Recall that lower AIC means a better model. Try some more models and see if you can get a lower AIC value. Some ideas might include: interactions between terms (e.g. include + Wind*Temp), quadratic functions (e.g. include + I(Wind^2)), and changing month and day to be factor rather than numerical variables.

Task set 2: logistic regression

  1. Load in the horseshoe.csv data set from the data directory and familiarise yourself with the data structure from the data_description.txt file

  2. Turn the color and spine variables into factors with their proper names

  3. Familiarise yourself by plotting the data and exploring the structure between the variables

  4. Create a binary variable which contains only whether the satellite variable is >0 or not. We will use this as our response variable. Create a plot which shows the relationship of this variable with width.

  5. Fit a binomial glm (a logistic regression) with the binary variable as the response and width as a covariate. Summarise your findings

  6. Create a plot of the fitted values on top of a scatter plot of the data (hint: width on x-axis, binary response variable on y-axis)

  7. Try fitting some more models to the data with more variables (and perhaps interactions) to see if you can get the AIC lower. Compare your new models’ fitted values to the first model

Task set 3: Poisson and Negative Binomial regression

  1. This time fit a Poisson GLM to the horseshoe data, using the original number of satellites rather than the binary version you fitted previously. Again use width as the sole covariate, and again plot the fitted values

  2. Now try a model with all of the covariates (make sure not to include the binary variable you created before). Summarise and see if there’s any improvement. You might notice that more variables are important (compared to the previous binary logistic regression) because we’re using more of the data

  3. A common occurrence is that the Poisson distribution is a poor fit to the data as the mean=variance relationship is rarely met. (You could check if the mean and the variance match for these data). A common alternative is to fit a Negative-Binomial GLM which has an extra parameter to measure excess variance (over-dispersion). The glm function doesn’t have this distribution in it by default so you need to call in the MASS library with library(MASS). The command will now be called glm.nb which will also try to estimate the over-dispersion parameter (the excess variance over the Poisson). Use glm.nb to fit a Negative Binomial GLM to these data, interpret your findings, see if the AIC improves, and plot your output.

Extra questions

If you did all the above and have time to spare, try these:

  1. Another data sets which is worth fitting a linear regression model to is the geese_isotopes.csv data. You might like to see if one of the isotope values is affected by some of the other variables (sex, adult, etc)
  2. The whitefly data set. This is binomial (as used in the lectures) but you might like to additionally try some of the other variables and see which are important and why. The data set has particular issues with zero-inflation. See if you can find which zero data points are poorly predicted by the model.

Answers

Task set 1

  str(airquality)
  ## 'data.frame':  153 obs. of  6 variables:
  ##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
  ##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
  ##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
  ##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
  ##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
  ##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
  pairs(airquality) # Sees to be interesting relationship between wind, temp and ozone. Less obvious relationship between Solar and Ozone. Unsurprisingly relationship between month and temp non-linear

  mod_1 = lm(Ozone ~ ., data = airquality)
  summary(mod_1) # 'significant' variables are solr, wind, temp, month. Month being treated incorrectly? Perhaps should be input as a factor?
  ## 
  ## Call:
  ## lm(formula = Ozone ~ ., data = airquality)
  ## 
  ## Residuals:
  ##     Min      1Q  Median      3Q     Max 
  ## -37.014 -12.284  -3.302   8.454  95.348 
  ## 
  ## Coefficients:
  ##              Estimate Std. Error t value Pr(>|t|)
  ## (Intercept) -64.11632   23.48249  -2.730  0.00742
  ## Solar.R       0.05027    0.02342   2.147  0.03411
  ## Wind         -3.31844    0.64451  -5.149 1.23e-06
  ## Temp          1.89579    0.27389   6.922 3.66e-10
  ## Month        -3.03996    1.51346  -2.009  0.04714
  ## Day           0.27388    0.22967   1.192  0.23576
  ##                
  ## (Intercept) ** 
  ## Solar.R     *  
  ## Wind        ***
  ## Temp        ***
  ## Month       *  
  ## Day            
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 20.86 on 105 degrees of freedom
  ##   (42 observations deleted due to missingness)
  ## Multiple R-squared:  0.6249,   Adjusted R-squared:  0.6071 
  ## F-statistic: 34.99 on 5 and 105 DF,  p-value: < 2.2e-16
  # Histogram and qq plot of residuals
  hist(mod_1$residuals, breaks = 30) #A bit skewed

  qqnorm(mod_1$residuals)
  qqline(mod_1$residuals) # Not great at the top end

  mod_2 = lm(log(Ozone) ~ ., data = airquality)
  summary(mod_2) # Now only solar, wind and temp. 
  ## 
  ## Call:
  ## lm(formula = log(Ozone) ~ ., data = airquality)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -2.12941 -0.29248 -0.00188  0.28498  1.15883 
  ## 
  ## Coefficients:
  ##               Estimate Std. Error t value
  ## (Intercept) -0.2773962  0.5737028  -0.484
  ## Solar.R      0.0023988  0.0005721   4.193
  ## Wind        -0.0613804  0.0157461  -3.898
  ## Temp         0.0522667  0.0066914   7.811
  ## Month       -0.0378341  0.0369754  -1.023
  ## Day          0.0042116  0.0056111   0.751
  ##             Pr(>|t|)    
  ## (Intercept) 0.629734    
  ## Solar.R     5.77e-05 ***
  ## Wind        0.000171 ***
  ## Temp        4.53e-12 ***
  ## Month       0.308554    
  ## Day         0.454583    
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 0.5096 on 105 degrees of freedom
  ##   (42 observations deleted due to missingness)
  ## Multiple R-squared:  0.6694,   Adjusted R-squared:  0.6536 
  ## F-statistic: 42.51 on 5 and 105 DF,  p-value: < 2.2e-16
  hist(mod_2$residuals, breaks = 30) # Much better - still one weird value

  qqnorm(mod_2$residuals)
  qqline(mod_2$residuals) # Much better - apart from one

  which.min(mod_2$residuals) # Is it observations 17 or 21???
  ## 21 
  ## 17
  airquality[c(17,21),] # It's definitely observation 21. Has an Ozone value of 1. Possibly incorrect?
  ##    Ozone Solar.R Wind Temp Month Day
  ## 17    34     307 12.0   66     5  17
  ## 21     1       8  9.7   59     5  21
  AIC(mod_2) # 173.1773
  ## [1] 173.1773
  # Try some other models
  mod_3 = lm(log(Ozone) ~ .*., data = airquality) # All 2-way interactions
  summary(mod_3)
  ## 
  ## Call:
  ## lm(formula = log(Ozone) ~ . * ., data = airquality)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -1.96680 -0.29278 -0.03529  0.27074  1.16273 
  ## 
  ## Coefficients:
  ##                 Estimate Std. Error t value
  ## (Intercept)   -2.962e+00  4.000e+00  -0.740
  ## Solar.R        3.555e-03  6.784e-03   0.524
  ## Wind           1.354e-01  1.445e-01   0.937
  ## Temp           9.341e-02  5.287e-02   1.767
  ## Month          4.381e-02  4.662e-01   0.094
  ## Day           -3.190e-02  7.588e-02  -0.420
  ## Solar.R:Wind  -1.013e-04  1.955e-04  -0.518
  ## Solar.R:Temp  -1.504e-05  8.474e-05  -0.178
  ## Solar.R:Month  1.061e-04  4.255e-04   0.249
  ## Solar.R:Day    2.120e-05  8.241e-05   0.257
  ## Wind:Temp     -2.886e-03  2.080e-03  -1.387
  ## Wind:Month     3.440e-03  1.272e-02   0.271
  ## Wind:Day       1.362e-03  2.119e-03   0.643
  ## Temp:Month    -1.908e-03  5.205e-03  -0.367
  ## Temp:Day       2.235e-04  9.195e-04   0.243
  ## Month:Day      2.556e-04  5.128e-03   0.050
  ##               Pr(>|t|)  
  ## (Intercept)     0.4608  
  ## Solar.R         0.6015  
  ## Wind            0.3510  
  ## Temp            0.0805 .
  ## Month           0.9253  
  ## Day             0.6751  
  ## Solar.R:Wind    0.6054  
  ## Solar.R:Temp    0.8595  
  ## Solar.R:Month   0.8036  
  ## Solar.R:Day     0.7975  
  ## Wind:Temp       0.1686  
  ## Wind:Month      0.7873  
  ## Wind:Day        0.5220  
  ## Temp:Month      0.7147  
  ## Temp:Day        0.8085  
  ## Month:Day       0.9604  
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 0.5211 on 95 degrees of freedom
  ##   (42 observations deleted due to missingness)
  ## Multiple R-squared:  0.6872,   Adjusted R-squared:  0.6379 
  ## F-statistic: 13.92 on 15 and 95 DF,  p-value: < 2.2e-16
  AIC(mod_3) # 187.0112
  ## [1] 187.0112
  mod_4 = lm(log(Ozone) ~ (.*.)^3, data = airquality) # All 3-way interactions
  summary(mod_4)
  ## 
  ## Call:
  ## lm(formula = log(Ozone) ~ (. * .)^3, data = airquality)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -1.40242 -0.22194 -0.00753  0.24366  1.39709 
  ## 
  ## Coefficients:
  ##                               Estimate Std. Error
  ## (Intercept)                 -4.453e+01  5.200e+01
  ## Solar.R                      2.367e-01  2.782e-01
  ## Wind                         7.673e+00  4.649e+00
  ## Temp                         7.372e-01  6.941e-01
  ## Month                        9.724e+00  8.880e+00
  ## Day                          1.752e+00  2.982e+00
  ## Solar.R:Wind                -3.160e-02  2.523e-02
  ## Solar.R:Temp                -3.442e-03  3.633e-03
  ## Solar.R:Month               -5.018e-02  4.678e-02
  ## Solar.R:Day                 -1.273e-02  1.674e-02
  ## Wind:Temp                   -1.150e-01  6.210e-02
  ## Wind:Month                  -1.350e+00  8.335e-01
  ## Wind:Day                    -4.049e-01  2.621e-01
  ## Temp:Month                  -1.357e-01  1.109e-01
  ## Temp:Day                    -3.168e-02  4.278e-02
  ## Month:Day                   -4.454e-01  4.530e-01
  ## Solar.R:Wind:Temp            4.616e-04  3.284e-04
  ## Solar.R:Wind:Month           5.691e-03  4.389e-03
  ## Solar.R:Wind:Day             1.923e-03  1.471e-03
  ## Solar.R:Temp:Month           6.785e-04  5.859e-04
  ## Solar.R:Temp:Day             1.987e-04  2.311e-04
  ## Solar.R:Month:Day            2.704e-03  2.460e-03
  ## Wind:Temp:Month              1.888e-02  1.037e-02
  ## Wind:Temp:Day                6.452e-03  3.825e-03
  ## Wind:Month:Day               7.202e-02  4.043e-02
  ## Temp:Month:Day               6.747e-03  6.159e-03
  ## Solar.R:Wind:Temp:Month     -7.825e-05  5.479e-05
  ## Solar.R:Wind:Temp:Day       -2.940e-05  2.047e-05
  ## Solar.R:Wind:Month:Day      -3.372e-04  2.185e-04
  ## Solar.R:Temp:Month:Day      -3.845e-05  3.283e-05
  ## Wind:Temp:Month:Day         -1.071e-03  5.540e-04
  ## Solar.R:Wind:Temp:Month:Day  4.867e-06  2.924e-06
  ##                             t value Pr(>|t|)  
  ## (Intercept)                  -0.856   0.3945  
  ## Solar.R                       0.851   0.3975  
  ## Wind                          1.650   0.1028  
  ## Temp                          1.062   0.2914  
  ## Month                         1.095   0.2768  
  ## Day                           0.587   0.5585  
  ## Solar.R:Wind                 -1.252   0.2141  
  ## Solar.R:Temp                 -0.948   0.3463  
  ## Solar.R:Month                -1.073   0.2867  
  ## Solar.R:Day                  -0.761   0.4491  
  ## Wind:Temp                    -1.852   0.0678 .
  ## Wind:Month                   -1.619   0.1094  
  ## Wind:Day                     -1.545   0.1264  
  ## Temp:Month                   -1.224   0.2246  
  ## Temp:Day                     -0.740   0.4612  
  ## Month:Day                    -0.983   0.3285  
  ## Solar.R:Wind:Temp             1.406   0.1637  
  ## Solar.R:Wind:Month            1.296   0.1986  
  ## Solar.R:Wind:Day              1.307   0.1949  
  ## Solar.R:Temp:Month            1.158   0.2503  
  ## Solar.R:Temp:Day              0.860   0.3925  
  ## Solar.R:Month:Day             1.099   0.2750  
  ## Wind:Temp:Month               1.821   0.0724 .
  ## Wind:Temp:Day                 1.687   0.0956 .
  ## Wind:Month:Day                1.781   0.0787 .
  ## Temp:Month:Day                1.095   0.2767  
  ## Solar.R:Wind:Temp:Month      -1.428   0.1572  
  ## Solar.R:Wind:Temp:Day        -1.436   0.1549  
  ## Solar.R:Wind:Month:Day       -1.543   0.1268  
  ## Solar.R:Temp:Month:Day       -1.171   0.2451  
  ## Wind:Temp:Month:Day          -1.933   0.0568 .
  ## Solar.R:Wind:Temp:Month:Day   1.665   0.1000 .
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 0.4921 on 79 degrees of freedom
  ##   (42 observations deleted due to missingness)
  ## Multiple R-squared:  0.768,    Adjusted R-squared:  0.677 
  ## F-statistic: 8.436 on 31 and 79 DF,  p-value: 1.129e-14
  AIC(mod_4) # 185.8515
  ## [1] 185.8515
  # Model with month as a factor
  mod_5 = lm(log(Ozone) ~ . - Month + as.factor(Month), data = airquality)
  summary(mod_5)
  ## 
  ## Call:
  ## lm(formula = log(Ozone) ~ . - Month + as.factor(Month), data = airquality)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -2.13862 -0.27374 -0.00767  0.26599  1.14361 
  ## 
  ## Coefficients:
  ##                     Estimate Std. Error t value
  ## (Intercept)       -0.4449430  0.6691203  -0.665
  ## Solar.R            0.0024432  0.0005871   4.161
  ## Wind              -0.0592787  0.0163748  -3.620
  ## Temp               0.0519251  0.0085317   6.086
  ## Day                0.0035922  0.0057353   0.626
  ## as.factor(Month)6 -0.1597722  0.2263292  -0.706
  ## as.factor(Month)7 -0.1087654  0.1954027  -0.557
  ## as.factor(Month)8 -0.0395237  0.2038328  -0.194
  ## as.factor(Month)9 -0.1970522  0.1652345  -1.193
  ##                   Pr(>|t|)    
  ## (Intercept)        0.50757    
  ## Solar.R           6.62e-05 ***
  ## Wind               0.00046 ***
  ## Temp              2.05e-08 ***
  ## Day                0.53250    
  ## as.factor(Month)6  0.48184    
  ## as.factor(Month)7  0.57900    
  ## as.factor(Month)8  0.84664    
  ## as.factor(Month)9  0.23581    
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 0.5141 on 102 degrees of freedom
  ##   (42 observations deleted due to missingness)
  ## Multiple R-squared:  0.6732,   Adjusted R-squared:  0.6475 
  ## F-statistic: 26.26 on 8 and 102 DF,  p-value: < 2.2e-16
  AIC(mod_5) # 177.8937
  ## [1] 177.8937
  # Model without day
  mod_6 = lm(log(Ozone) ~ . - Day - Month + as.factor(Month), data = airquality)
  summary(mod_6)
  ## 
  ## Call:
  ## lm(formula = log(Ozone) ~ . - Day - Month + as.factor(Month), 
  ##     data = airquality)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -2.12662 -0.28197 -0.02742  0.28834  1.16785 
  ## 
  ## Coefficients:
  ##                     Estimate Std. Error t value
  ## (Intercept)       -0.3392540  0.6455802  -0.526
  ## Solar.R            0.0024410  0.0005854   4.170
  ## Wind              -0.0592068  0.0163260  -3.627
  ## Temp               0.0511975  0.0084272   6.075
  ## as.factor(Month)6 -0.1575432  0.2256326  -0.698
  ## as.factor(Month)7 -0.0952694  0.1936372  -0.492
  ## as.factor(Month)8 -0.0228939  0.2014990  -0.114
  ## as.factor(Month)9 -0.1929045  0.1646140  -1.172
  ##                   Pr(>|t|)    
  ## (Intercept)       0.600363    
  ## Solar.R           6.36e-05 ***
  ## Wind              0.000449 ***
  ## Temp              2.10e-08 ***
  ## as.factor(Month)6 0.486607    
  ## as.factor(Month)7 0.623767    
  ## as.factor(Month)8 0.909762    
  ## as.factor(Month)9 0.243957    
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 0.5125 on 103 degrees of freedom
  ##   (42 observations deleted due to missingness)
  ## Multiple R-squared:  0.6719,   Adjusted R-squared:  0.6496 
  ## F-statistic: 30.13 on 7 and 103 DF,  p-value: < 2.2e-16
  AIC(mod_6) # 176.3198 # Very slightly 'better'
  ## [1] 176.3198

Task set 2

  horseshoe = read.csv('../data/horseshoe.csv')
  str(horseshoe)
  ## 'data.frame':  173 obs. of  5 variables:
  ##  $ color : int  2 3 1 3 3 2 1 3 2 3 ...
  ##  $ spine : int  3 3 1 3 3 3 1 2 1 3 ...
  ##  $ width : num  28.3 22.5 26 24.8 26 23.8 26.5 24.7 23.7 25.6 ...
  ##  $ satell: int  8 0 9 0 4 0 0 0 0 0 ...
  ##  $ weight: num  3.05 1.55 2.3 2.1 2.6 2.1 2.35 1.9 1.95 2.15 ...
  horseshoe$color_fac = factor(horseshoe$color,
                             labels = c('light medium', 'medium', 
                                        'dark medium', 'dark'))
  horseshoe$spine_fac = factor(horseshoe$spine,
                             labels = c('both good', 
                                        'one worn or broken', 
                                        'both worn or broken'))
                                        
  pairs(horseshoe) # Strong relationship between width and weight

  # Response variable is satell so perhaps explore more
  boxplot(satell ~ spine_fac, data = horseshoe)

  boxplot(satell ~ color_fac, data = horseshoe)

  horseshoe$satell_bin = as.integer(horseshoe$satell > 0)
  boxplot(width ~ satell_bin, data = horseshoe) # Higher width with satell > 0

  logistic_mod = glm(satell_bin ~ width, data = horseshoe,
                   family = 'binomial')
  summary(logistic_mod)
  ## 
  ## Call:
  ## glm(formula = satell_bin ~ width, family = "binomial", data = horseshoe)
  ## 
  ## Deviance Residuals: 
  ##     Min       1Q   Median       3Q      Max  
  ## -2.0281  -1.0458   0.5480   0.9066   1.6942  
  ## 
  ## Coefficients:
  ##             Estimate Std. Error z value Pr(>|z|)
  ## (Intercept) -12.3508     2.6287  -4.698 2.62e-06
  ## width         0.4972     0.1017   4.887 1.02e-06
  ##                
  ## (Intercept) ***
  ## width       ***
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## (Dispersion parameter for binomial family taken to be 1)
  ## 
  ##     Null deviance: 225.76  on 172  degrees of freedom
  ## Residual deviance: 194.45  on 171  degrees of freedom
  ## AIC: 198.45
  ## 
  ## Number of Fisher Scoring iterations: 4
  with(horseshoe, plot(width, satell_bin))
  points(horseshoe$width, logistic_mod$fitted.values, col = 'red')

  # Must be able to make this plot look nicer!
  
  logistic_mod_2 = glm(satell_bin ~ width + color_fac + spine_fac, data = horseshoe,
                   family = 'binomial')
  summary(logistic_mod_2)
  ## 
  ## Call:
  ## glm(formula = satell_bin ~ width + color_fac + spine_fac, family = "binomial", 
  ##     data = horseshoe)
  ## 
  ## Deviance Residuals: 
  ##     Min       1Q   Median       3Q      Max  
  ## -2.1206  -0.9724   0.5076   0.8750   2.1158  
  ## 
  ## Coefficients:
  ##                               Estimate Std. Error
  ## (Intercept)                  -11.09953    2.97706
  ## width                          0.45624    0.10779
  ## color_facmedium               -0.14340    0.77838
  ## color_facdark medium          -0.52405    0.84685
  ## color_facdark                 -1.66833    0.93285
  ## spine_facone worn or broken   -0.05782    0.70308
  ## spine_facboth worn or broken   0.37703    0.50191
  ##                              z value Pr(>|z|)    
  ## (Intercept)                   -3.728 0.000193 ***
  ## width                          4.233 2.31e-05 ***
  ## color_facmedium               -0.184 0.853830    
  ## color_facdark medium          -0.619 0.536030    
  ## color_facdark                 -1.788 0.073706 .  
  ## spine_facone worn or broken   -0.082 0.934453    
  ## spine_facboth worn or broken   0.751 0.452540    
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## (Dispersion parameter for binomial family taken to be 1)
  ## 
  ##     Null deviance: 225.76  on 172  degrees of freedom
  ## Residual deviance: 186.61  on 166  degrees of freedom
  ## AIC: 200.61
  ## 
  ## Number of Fisher Scoring iterations: 4
  AIC(logistic_mod, logistic_mod_2) # New model not any better
  ##                df      AIC
  ## logistic_mod    2 198.4527
  ## logistic_mod_2  7 200.6119
  # Many more modelling options to try here

Task set 3

  pois_mod = glm(satell ~ width, data = horseshoe, family = poisson)
  summary(pois_mod)
  ## 
  ## Call:
  ## glm(formula = satell ~ width, family = poisson, data = horseshoe)
  ## 
  ## Deviance Residuals: 
  ##     Min       1Q   Median       3Q      Max  
  ## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
  ## 
  ## Coefficients:
  ##             Estimate Std. Error z value Pr(>|z|)
  ## (Intercept) -3.30476    0.54224  -6.095  1.1e-09
  ## width        0.16405    0.01997   8.216  < 2e-16
  ##                
  ## (Intercept) ***
  ## width       ***
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## (Dispersion parameter for poisson family taken to be 1)
  ## 
  ##     Null deviance: 632.79  on 172  degrees of freedom
  ## Residual deviance: 567.88  on 171  degrees of freedom
  ## AIC: 927.18
  ## 
  ## Number of Fisher Scoring iterations: 6
  # Example of a qq-plot for a Poisson distribution - bit of a hack!
  # qqrplot comes from the countreg package
  source('https://raw.githubusercontent.com/rforge/countreg/master/pkg/R/qqrplot.R')
  source('https://raw.githubusercontent.com/rforge/countreg/master/pkg/R/qresiduals.R')
  qqrplot(pois_mod, 
        main = "Poisson")

  pois_mod_2 = glm(satell ~ width + color_fac + spine_fac + weight, data = horseshoe, family = poisson)
  summary(pois_mod_2)
  ## 
  ## Call:
  ## glm(formula = satell ~ width + color_fac + spine_fac + weight, 
  ##     family = poisson, data = horseshoe)
  ## 
  ## Deviance Residuals: 
  ##     Min       1Q   Median       3Q      Max  
  ## -2.9875  -1.9013  -0.5923   0.9464   4.9012  
  ## 
  ## Coefficients:
  ##                              Estimate Std. Error
  ## (Intercept)                  -0.84651    0.93057
  ## width                         0.04531    0.04649
  ## color_facmedium              -0.25471    0.16809
  ## color_facdark medium         -0.50104    0.19541
  ## color_facdark                -0.49764    0.22679
  ## spine_facone worn or broken  -0.14869    0.21337
  ## spine_facboth worn or broken  0.07272    0.11941
  ## weight                        0.38814    0.15693
  ##                              z value Pr(>|z|)  
  ## (Intercept)                   -0.910   0.3630  
  ## width                          0.975   0.3297  
  ## color_facmedium               -1.515   0.1297  
  ## color_facdark medium          -2.564   0.0103 *
  ## color_facdark                 -2.194   0.0282 *
  ## spine_facone worn or broken   -0.697   0.4859  
  ## spine_facboth worn or broken   0.609   0.5425  
  ## weight                         2.473   0.0134 *
  ## ---
  ## Signif. codes:  
  ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## (Dispersion parameter for poisson family taken to be 1)
  ## 
  ##     Null deviance: 632.79  on 172  degrees of freedom
  ## Residual deviance: 552.25  on 165  degrees of freedom
  ## AIC: 923.55
  ## 
  ## Number of Fisher Scoring iterations: 6
  library(MASS)
  ## 
  ## Attaching package: 'MASS'
  ## The following object is masked from 'package:dplyr':
  ## 
  ##     select
  nb_mod = glm.nb(satell ~ width + color_fac + spine_fac + weight, data = horseshoe)
  summary(nb_mod)
  ## 
  ## Call:
  ## glm.nb(formula = satell ~ width + color_fac + spine_fac + weight, 
  ##     data = horseshoe, init.theta = 0.9521901429, link = log)
  ## 
  ## Deviance Residuals: 
  ##     Min       1Q   Median       3Q      Max  
  ## -1.8535  -1.3633  -0.3033   0.4685   2.2238  
  ## 
  ## Coefficients:
  ##                              Estimate Std. Error
  ## (Intercept)                  -1.58662    1.83022
  ## width                         0.07734    0.09053
  ## color_facmedium              -0.31434    0.37457
  ## color_facdark medium         -0.57764    0.41921
  ## color_facdark                -0.53423    0.46877
  ## spine_facone worn or broken  -0.21439    0.39927
  ## spine_facboth worn or broken  0.02570    0.24903
  ## weight                        0.37950    0.31432
  ##                              z value Pr(>|z|)
  ## (Intercept)                   -0.867    0.386
  ## width                          0.854    0.393
  ## color_facmedium               -0.839    0.401
  ## color_facdark medium          -1.378    0.168
  ## color_facdark                 -1.140    0.254
  ## spine_facone worn or broken   -0.537    0.591
  ## spine_facboth worn or broken   0.103    0.918
  ## weight                         1.207    0.227
  ## 
  ## (Dispersion parameter for Negative Binomial(0.9522) family taken to be 1)
  ## 
  ##     Null deviance: 219.08  on 172  degrees of freedom
  ## Residual deviance: 196.73  on 165  degrees of freedom
  ## AIC: 764.9
  ## 
  ## Number of Fisher Scoring iterations: 1
  ## 
  ## 
  ##               Theta:  0.952 
  ##           Std. Err.:  0.173 
  ## 
  ##  2 x log-likelihood:  -746.905
  AIC(pois_mod, pois_mod_2, nb_mod) # NegBin model way 'better' in terms of AIC
  ##            df      AIC
  ## pois_mod    2 927.1762
  ## pois_mod_2  8 923.5498
  ## nb_mod      9 764.9049
  # Plot of fitted values
  with(horseshoe, plot(width, satell))
  points(horseshoe$width, nb_mod$fitted.values, col = 'red')

  # Width not the only variable we could plot against (try e.g. color_fac or others)
  
  # QQ-plot for negative binomial
  qqrplot(nb_mod, 
        main = "Negative Binomial")

  # Compare with the Poisson fit
  qqrplot(pois_mod_2, 
        main = "Poisson")