Welcome to the first user-guided practical on linear regression and GLMs. In this practical you will:
There are four sections. You should work your way through the questions and put your hand up if you get stuck. There is an answer script given at the bottom but please try not to use it unless you are completely stuck and cannot ask a question.
You can run the code from these practicals is by loading up the .Rmd
(Rmarkdown) file in the same directory in Rstudio. Feel free to add in your own answers, or edit the text to give yourself extra notes. You can also run the code directly by highlighting the relevant code and clicking Run
. Much of this material overlaps with the class slides so sometimes if you get stuck you might get a clue by looking at the .Rmd
file in the slides
folder.
One final small note: if you are copying R commands from the pdf or html files into your R script or console window sometimes the inverted commas can copy across incorrectly. If you get a weird message saying Error: unexpected input
you usually just need to delete/replace the inverted commas.
The airquality
data set is included with R. You can find a description of the data at help(airquality)
and can get at it simply by typing airquality
at the R command prompt. Let’s suppose we’re interested in predicting the variable Ozone
from the other variables
Tasks:
First create some plots of the variables and make sure you understand the relationship between them. Hint: A good start can be found in the help(airquality)
file (see the example at the bottom of that page).
Fit a linear regression model using lm
with Ozone
as the response and all the other variables as covariates. Use the summary
method to interpret your findings. (Note that R here is automatically removing rows with missing variables)
Have a look at the residuals of the model (e.g. histograms and QQ-plots). Does the model fit well?
Try another model but this time using the log of Ozone instead. Does it fit better?
Identify the one strange observation and see if you can work out what happened that day
You can get the AIC of this model with e.g. AIC(my_model)
. Recall that lower AIC means a better model. Try some more models and see if you can get a lower AIC value. Some ideas might include: interactions between terms (e.g. include + Wind*Temp
), quadratic functions (e.g. include + I(Wind^2)
), and changing month and day to be factor rather than numerical variables.
Load in the horseshoe.csv
data set from the data directory and familiarise yourself with the data structure from the data_description.txt
file
Turn the color
and spine
variables into factors with their proper names
Familiarise yourself by plotting the data and exploring the structure between the variables
Create a binary variable which contains only whether the satellite variable is >0 or not. We will use this as our response variable. Create a plot which shows the relationship of this variable with width
.
Fit a binomial glm (a logistic regression) with the binary variable as the response and width
as a covariate. Summarise your findings
Create a plot of the fitted values on top of a scatter plot of the data (hint: width on x-axis, binary response variable on y-axis)
Try fitting some more models to the data with more variables (and perhaps interactions) to see if you can get the AIC lower. Compare your new models’ fitted values to the first model
This time fit a Poisson GLM to the horseshoe data, using the original number of satellites rather than the binary version you fitted previously. Again use width
as the sole covariate, and again plot the fitted values
Now try a model with all of the covariates (make sure not to include the binary variable you created before). Summarise and see if there’s any improvement. You might notice that more variables are important (compared to the previous binary logistic regression) because we’re using more of the data
A common occurrence is that the Poisson distribution is a poor fit to the data as the mean=variance relationship is rarely met. (You could check if the mean and the variance match for these data). A common alternative is to fit a Negative-Binomial GLM which has an extra parameter to measure excess variance (over-dispersion). The glm
function doesn’t have this distribution in it by default so you need to call in the MASS library with library(MASS)
. The command will now be called glm.nb
which will also try to estimate the over-dispersion parameter (the excess variance over the Poisson). Use glm.nb
to fit a Negative Binomial GLM to these data, interpret your findings, see if the AIC improves, and plot your output.
If you did all the above and have time to spare, try these:
geese_isotopes.csv
data. You might like to see if one of the isotope values is affected by some of the other variables (sex, adult, etc)Task set 1
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
pairs(airquality) # Sees to be interesting relationship between wind, temp and ozone. Less obvious relationship between Solar and Ozone. Unsurprisingly relationship between month and temp non-linear
mod_1 = lm(Ozone ~ ., data = airquality)
summary(mod_1) # 'significant' variables are solr, wind, temp, month. Month being treated incorrectly? Perhaps should be input as a factor?
##
## Call:
## lm(formula = Ozone ~ ., data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.014 -12.284 -3.302 8.454 95.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.11632 23.48249 -2.730 0.00742
## Solar.R 0.05027 0.02342 2.147 0.03411
## Wind -3.31844 0.64451 -5.149 1.23e-06
## Temp 1.89579 0.27389 6.922 3.66e-10
## Month -3.03996 1.51346 -2.009 0.04714
## Day 0.27388 0.22967 1.192 0.23576
##
## (Intercept) **
## Solar.R *
## Wind ***
## Temp ***
## Month *
## Day
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.86 on 105 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6249, Adjusted R-squared: 0.6071
## F-statistic: 34.99 on 5 and 105 DF, p-value: < 2.2e-16
# Histogram and qq plot of residuals
hist(mod_1$residuals, breaks = 30) #A bit skewed
qqnorm(mod_1$residuals)
qqline(mod_1$residuals) # Not great at the top end
mod_2 = lm(log(Ozone) ~ ., data = airquality)
summary(mod_2) # Now only solar, wind and temp.
##
## Call:
## lm(formula = log(Ozone) ~ ., data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12941 -0.29248 -0.00188 0.28498 1.15883
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -0.2773962 0.5737028 -0.484
## Solar.R 0.0023988 0.0005721 4.193
## Wind -0.0613804 0.0157461 -3.898
## Temp 0.0522667 0.0066914 7.811
## Month -0.0378341 0.0369754 -1.023
## Day 0.0042116 0.0056111 0.751
## Pr(>|t|)
## (Intercept) 0.629734
## Solar.R 5.77e-05 ***
## Wind 0.000171 ***
## Temp 4.53e-12 ***
## Month 0.308554
## Day 0.454583
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5096 on 105 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6694, Adjusted R-squared: 0.6536
## F-statistic: 42.51 on 5 and 105 DF, p-value: < 2.2e-16
hist(mod_2$residuals, breaks = 30) # Much better - still one weird value
qqnorm(mod_2$residuals)
qqline(mod_2$residuals) # Much better - apart from one
which.min(mod_2$residuals) # Is it observations 17 or 21???
## 21
## 17
airquality[c(17,21),] # It's definitely observation 21. Has an Ozone value of 1. Possibly incorrect?
## Ozone Solar.R Wind Temp Month Day
## 17 34 307 12.0 66 5 17
## 21 1 8 9.7 59 5 21
AIC(mod_2) # 173.1773
## [1] 173.1773
# Try some other models
mod_3 = lm(log(Ozone) ~ .*., data = airquality) # All 2-way interactions
summary(mod_3)
##
## Call:
## lm(formula = log(Ozone) ~ . * ., data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.96680 -0.29278 -0.03529 0.27074 1.16273
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2.962e+00 4.000e+00 -0.740
## Solar.R 3.555e-03 6.784e-03 0.524
## Wind 1.354e-01 1.445e-01 0.937
## Temp 9.341e-02 5.287e-02 1.767
## Month 4.381e-02 4.662e-01 0.094
## Day -3.190e-02 7.588e-02 -0.420
## Solar.R:Wind -1.013e-04 1.955e-04 -0.518
## Solar.R:Temp -1.504e-05 8.474e-05 -0.178
## Solar.R:Month 1.061e-04 4.255e-04 0.249
## Solar.R:Day 2.120e-05 8.241e-05 0.257
## Wind:Temp -2.886e-03 2.080e-03 -1.387
## Wind:Month 3.440e-03 1.272e-02 0.271
## Wind:Day 1.362e-03 2.119e-03 0.643
## Temp:Month -1.908e-03 5.205e-03 -0.367
## Temp:Day 2.235e-04 9.195e-04 0.243
## Month:Day 2.556e-04 5.128e-03 0.050
## Pr(>|t|)
## (Intercept) 0.4608
## Solar.R 0.6015
## Wind 0.3510
## Temp 0.0805 .
## Month 0.9253
## Day 0.6751
## Solar.R:Wind 0.6054
## Solar.R:Temp 0.8595
## Solar.R:Month 0.8036
## Solar.R:Day 0.7975
## Wind:Temp 0.1686
## Wind:Month 0.7873
## Wind:Day 0.5220
## Temp:Month 0.7147
## Temp:Day 0.8085
## Month:Day 0.9604
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5211 on 95 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6872, Adjusted R-squared: 0.6379
## F-statistic: 13.92 on 15 and 95 DF, p-value: < 2.2e-16
AIC(mod_3) # 187.0112
## [1] 187.0112
mod_4 = lm(log(Ozone) ~ (.*.)^3, data = airquality) # All 3-way interactions
summary(mod_4)
##
## Call:
## lm(formula = log(Ozone) ~ (. * .)^3, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40242 -0.22194 -0.00753 0.24366 1.39709
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -4.453e+01 5.200e+01
## Solar.R 2.367e-01 2.782e-01
## Wind 7.673e+00 4.649e+00
## Temp 7.372e-01 6.941e-01
## Month 9.724e+00 8.880e+00
## Day 1.752e+00 2.982e+00
## Solar.R:Wind -3.160e-02 2.523e-02
## Solar.R:Temp -3.442e-03 3.633e-03
## Solar.R:Month -5.018e-02 4.678e-02
## Solar.R:Day -1.273e-02 1.674e-02
## Wind:Temp -1.150e-01 6.210e-02
## Wind:Month -1.350e+00 8.335e-01
## Wind:Day -4.049e-01 2.621e-01
## Temp:Month -1.357e-01 1.109e-01
## Temp:Day -3.168e-02 4.278e-02
## Month:Day -4.454e-01 4.530e-01
## Solar.R:Wind:Temp 4.616e-04 3.284e-04
## Solar.R:Wind:Month 5.691e-03 4.389e-03
## Solar.R:Wind:Day 1.923e-03 1.471e-03
## Solar.R:Temp:Month 6.785e-04 5.859e-04
## Solar.R:Temp:Day 1.987e-04 2.311e-04
## Solar.R:Month:Day 2.704e-03 2.460e-03
## Wind:Temp:Month 1.888e-02 1.037e-02
## Wind:Temp:Day 6.452e-03 3.825e-03
## Wind:Month:Day 7.202e-02 4.043e-02
## Temp:Month:Day 6.747e-03 6.159e-03
## Solar.R:Wind:Temp:Month -7.825e-05 5.479e-05
## Solar.R:Wind:Temp:Day -2.940e-05 2.047e-05
## Solar.R:Wind:Month:Day -3.372e-04 2.185e-04
## Solar.R:Temp:Month:Day -3.845e-05 3.283e-05
## Wind:Temp:Month:Day -1.071e-03 5.540e-04
## Solar.R:Wind:Temp:Month:Day 4.867e-06 2.924e-06
## t value Pr(>|t|)
## (Intercept) -0.856 0.3945
## Solar.R 0.851 0.3975
## Wind 1.650 0.1028
## Temp 1.062 0.2914
## Month 1.095 0.2768
## Day 0.587 0.5585
## Solar.R:Wind -1.252 0.2141
## Solar.R:Temp -0.948 0.3463
## Solar.R:Month -1.073 0.2867
## Solar.R:Day -0.761 0.4491
## Wind:Temp -1.852 0.0678 .
## Wind:Month -1.619 0.1094
## Wind:Day -1.545 0.1264
## Temp:Month -1.224 0.2246
## Temp:Day -0.740 0.4612
## Month:Day -0.983 0.3285
## Solar.R:Wind:Temp 1.406 0.1637
## Solar.R:Wind:Month 1.296 0.1986
## Solar.R:Wind:Day 1.307 0.1949
## Solar.R:Temp:Month 1.158 0.2503
## Solar.R:Temp:Day 0.860 0.3925
## Solar.R:Month:Day 1.099 0.2750
## Wind:Temp:Month 1.821 0.0724 .
## Wind:Temp:Day 1.687 0.0956 .
## Wind:Month:Day 1.781 0.0787 .
## Temp:Month:Day 1.095 0.2767
## Solar.R:Wind:Temp:Month -1.428 0.1572
## Solar.R:Wind:Temp:Day -1.436 0.1549
## Solar.R:Wind:Month:Day -1.543 0.1268
## Solar.R:Temp:Month:Day -1.171 0.2451
## Wind:Temp:Month:Day -1.933 0.0568 .
## Solar.R:Wind:Temp:Month:Day 1.665 0.1000 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4921 on 79 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.768, Adjusted R-squared: 0.677
## F-statistic: 8.436 on 31 and 79 DF, p-value: 1.129e-14
AIC(mod_4) # 185.8515
## [1] 185.8515
# Model with month as a factor
mod_5 = lm(log(Ozone) ~ . - Month + as.factor(Month), data = airquality)
summary(mod_5)
##
## Call:
## lm(formula = log(Ozone) ~ . - Month + as.factor(Month), data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.13862 -0.27374 -0.00767 0.26599 1.14361
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -0.4449430 0.6691203 -0.665
## Solar.R 0.0024432 0.0005871 4.161
## Wind -0.0592787 0.0163748 -3.620
## Temp 0.0519251 0.0085317 6.086
## Day 0.0035922 0.0057353 0.626
## as.factor(Month)6 -0.1597722 0.2263292 -0.706
## as.factor(Month)7 -0.1087654 0.1954027 -0.557
## as.factor(Month)8 -0.0395237 0.2038328 -0.194
## as.factor(Month)9 -0.1970522 0.1652345 -1.193
## Pr(>|t|)
## (Intercept) 0.50757
## Solar.R 6.62e-05 ***
## Wind 0.00046 ***
## Temp 2.05e-08 ***
## Day 0.53250
## as.factor(Month)6 0.48184
## as.factor(Month)7 0.57900
## as.factor(Month)8 0.84664
## as.factor(Month)9 0.23581
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5141 on 102 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6732, Adjusted R-squared: 0.6475
## F-statistic: 26.26 on 8 and 102 DF, p-value: < 2.2e-16
AIC(mod_5) # 177.8937
## [1] 177.8937
# Model without day
mod_6 = lm(log(Ozone) ~ . - Day - Month + as.factor(Month), data = airquality)
summary(mod_6)
##
## Call:
## lm(formula = log(Ozone) ~ . - Day - Month + as.factor(Month),
## data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12662 -0.28197 -0.02742 0.28834 1.16785
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -0.3392540 0.6455802 -0.526
## Solar.R 0.0024410 0.0005854 4.170
## Wind -0.0592068 0.0163260 -3.627
## Temp 0.0511975 0.0084272 6.075
## as.factor(Month)6 -0.1575432 0.2256326 -0.698
## as.factor(Month)7 -0.0952694 0.1936372 -0.492
## as.factor(Month)8 -0.0228939 0.2014990 -0.114
## as.factor(Month)9 -0.1929045 0.1646140 -1.172
## Pr(>|t|)
## (Intercept) 0.600363
## Solar.R 6.36e-05 ***
## Wind 0.000449 ***
## Temp 2.10e-08 ***
## as.factor(Month)6 0.486607
## as.factor(Month)7 0.623767
## as.factor(Month)8 0.909762
## as.factor(Month)9 0.243957
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5125 on 103 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6719, Adjusted R-squared: 0.6496
## F-statistic: 30.13 on 7 and 103 DF, p-value: < 2.2e-16
AIC(mod_6) # 176.3198 # Very slightly 'better'
## [1] 176.3198
Task set 2
horseshoe = read.csv('../data/horseshoe.csv')
str(horseshoe)
## 'data.frame': 173 obs. of 5 variables:
## $ color : int 2 3 1 3 3 2 1 3 2 3 ...
## $ spine : int 3 3 1 3 3 3 1 2 1 3 ...
## $ width : num 28.3 22.5 26 24.8 26 23.8 26.5 24.7 23.7 25.6 ...
## $ satell: int 8 0 9 0 4 0 0 0 0 0 ...
## $ weight: num 3.05 1.55 2.3 2.1 2.6 2.1 2.35 1.9 1.95 2.15 ...
horseshoe$color_fac = factor(horseshoe$color,
labels = c('light medium', 'medium',
'dark medium', 'dark'))
horseshoe$spine_fac = factor(horseshoe$spine,
labels = c('both good',
'one worn or broken',
'both worn or broken'))
pairs(horseshoe) # Strong relationship between width and weight
# Response variable is satell so perhaps explore more
boxplot(satell ~ spine_fac, data = horseshoe)
boxplot(satell ~ color_fac, data = horseshoe)
horseshoe$satell_bin = as.integer(horseshoe$satell > 0)
boxplot(width ~ satell_bin, data = horseshoe) # Higher width with satell > 0
logistic_mod = glm(satell_bin ~ width, data = horseshoe,
family = 'binomial')
summary(logistic_mod)
##
## Call:
## glm(formula = satell_bin ~ width, family = "binomial", data = horseshoe)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0281 -1.0458 0.5480 0.9066 1.6942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3508 2.6287 -4.698 2.62e-06
## width 0.4972 0.1017 4.887 1.02e-06
##
## (Intercept) ***
## width ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 194.45 on 171 degrees of freedom
## AIC: 198.45
##
## Number of Fisher Scoring iterations: 4
with(horseshoe, plot(width, satell_bin))
points(horseshoe$width, logistic_mod$fitted.values, col = 'red')
# Must be able to make this plot look nicer!
logistic_mod_2 = glm(satell_bin ~ width + color_fac + spine_fac, data = horseshoe,
family = 'binomial')
summary(logistic_mod_2)
##
## Call:
## glm(formula = satell_bin ~ width + color_fac + spine_fac, family = "binomial",
## data = horseshoe)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1206 -0.9724 0.5076 0.8750 2.1158
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -11.09953 2.97706
## width 0.45624 0.10779
## color_facmedium -0.14340 0.77838
## color_facdark medium -0.52405 0.84685
## color_facdark -1.66833 0.93285
## spine_facone worn or broken -0.05782 0.70308
## spine_facboth worn or broken 0.37703 0.50191
## z value Pr(>|z|)
## (Intercept) -3.728 0.000193 ***
## width 4.233 2.31e-05 ***
## color_facmedium -0.184 0.853830
## color_facdark medium -0.619 0.536030
## color_facdark -1.788 0.073706 .
## spine_facone worn or broken -0.082 0.934453
## spine_facboth worn or broken 0.751 0.452540
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 186.61 on 166 degrees of freedom
## AIC: 200.61
##
## Number of Fisher Scoring iterations: 4
AIC(logistic_mod, logistic_mod_2) # New model not any better
## df AIC
## logistic_mod 2 198.4527
## logistic_mod_2 7 200.6119
# Many more modelling options to try here
Task set 3
pois_mod = glm(satell ~ width, data = horseshoe, family = poisson)
summary(pois_mod)
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = horseshoe)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09
## width 0.16405 0.01997 8.216 < 2e-16
##
## (Intercept) ***
## width ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
# Example of a qq-plot for a Poisson distribution - bit of a hack!
# qqrplot comes from the countreg package
source('https://raw.githubusercontent.com/rforge/countreg/master/pkg/R/qqrplot.R')
source('https://raw.githubusercontent.com/rforge/countreg/master/pkg/R/qresiduals.R')
qqrplot(pois_mod,
main = "Poisson")
pois_mod_2 = glm(satell ~ width + color_fac + spine_fac + weight, data = horseshoe, family = poisson)
summary(pois_mod_2)
##
## Call:
## glm(formula = satell ~ width + color_fac + spine_fac + weight,
## family = poisson, data = horseshoe)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9875 -1.9013 -0.5923 0.9464 4.9012
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.84651 0.93057
## width 0.04531 0.04649
## color_facmedium -0.25471 0.16809
## color_facdark medium -0.50104 0.19541
## color_facdark -0.49764 0.22679
## spine_facone worn or broken -0.14869 0.21337
## spine_facboth worn or broken 0.07272 0.11941
## weight 0.38814 0.15693
## z value Pr(>|z|)
## (Intercept) -0.910 0.3630
## width 0.975 0.3297
## color_facmedium -1.515 0.1297
## color_facdark medium -2.564 0.0103 *
## color_facdark -2.194 0.0282 *
## spine_facone worn or broken -0.697 0.4859
## spine_facboth worn or broken 0.609 0.5425
## weight 2.473 0.0134 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 552.25 on 165 degrees of freedom
## AIC: 923.55
##
## Number of Fisher Scoring iterations: 6
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
nb_mod = glm.nb(satell ~ width + color_fac + spine_fac + weight, data = horseshoe)
summary(nb_mod)
##
## Call:
## glm.nb(formula = satell ~ width + color_fac + spine_fac + weight,
## data = horseshoe, init.theta = 0.9521901429, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8535 -1.3633 -0.3033 0.4685 2.2238
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -1.58662 1.83022
## width 0.07734 0.09053
## color_facmedium -0.31434 0.37457
## color_facdark medium -0.57764 0.41921
## color_facdark -0.53423 0.46877
## spine_facone worn or broken -0.21439 0.39927
## spine_facboth worn or broken 0.02570 0.24903
## weight 0.37950 0.31432
## z value Pr(>|z|)
## (Intercept) -0.867 0.386
## width 0.854 0.393
## color_facmedium -0.839 0.401
## color_facdark medium -1.378 0.168
## color_facdark -1.140 0.254
## spine_facone worn or broken -0.537 0.591
## spine_facboth worn or broken 0.103 0.918
## weight 1.207 0.227
##
## (Dispersion parameter for Negative Binomial(0.9522) family taken to be 1)
##
## Null deviance: 219.08 on 172 degrees of freedom
## Residual deviance: 196.73 on 165 degrees of freedom
## AIC: 764.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.952
## Std. Err.: 0.173
##
## 2 x log-likelihood: -746.905
AIC(pois_mod, pois_mod_2, nb_mod) # NegBin model way 'better' in terms of AIC
## df AIC
## pois_mod 2 927.1762
## pois_mod_2 8 923.5498
## nb_mod 9 764.9049
# Plot of fitted values
with(horseshoe, plot(width, satell))
points(horseshoe$width, nb_mod$fitted.values, col = 'red')
# Width not the only variable we could plot against (try e.g. color_fac or others)
# QQ-plot for negative binomial
qqrplot(nb_mod,
main = "Negative Binomial")
# Compare with the Poisson fit
qqrplot(pois_mod_2,
main = "Poisson")