{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
A glossary of terms for the course Introduction to Bayesian Hierarchical Modelling
- Posterior distribution: A probability distribution of the parameters given the data. This is usually represented as a matrix where the rows represent each sample and the columns represent each parameter.
- Likelihood: The probability of observing the data given some parameter values. For a given set of parameters, the likelihood can be represented as a single numerical value
- Prior distribution: The probability distribution of the parameters obtained externally from the data (i.e. before you have seen the data, or more practically obtained from information separately to the current experiment).
- Prior predictive distribution: A pseudo-data set obtained (possibly repeatedly) by simulating first from the prior distribution of the parameters, and subsequently the likelihood. If you have used good priors then this pseudo-data should ‘look like’ real data.
- Posterior predictive distribution: A pseudo-data set obtained (usually repeatedly) by simulating first from the posterior distribution of the parameters, and subsequently from the likelihood. If the model is a good fit to the data, the pseudo-data set should be similar to the real data set.
- Generalised Linear Model (GLM): A type of statistical model where a response variable is related to an explanatory variable via a probability distribution (through the likelihood), and a link function.
- Link functions: A function which transforms a set of parameters in a probability distribution from a restricted to an unrestricted range. The unrestricted range can then be used to incorporate the covariates.
- Directed Acyclic Graph (DAG): A graphical display of a model used to identify the links between the parameters and the data. Circles are used to display parameters, squares for data, and arrows are used to identify links. Dashed lines are sometimes also used to display indexed variables (e.g. observations)
- Information criteria: A set of tools used to compare between models. The attempt to penalise the deviance (minus twice the log-likelihood) by a measure of the complexity of the model, with the idea that the ‘best’ models will represent the data well and be relatively simple. Common information criteria for Bayesian models are the Deviance Information Criterion (DIC), and the Widely Applicable Information Criterion (WAIC). Smaller values of the IC tend to indicate ‘better’ models.
- Deviance: A measure of the fit of the model using only the likelihood score. The Deviance is calculated as minus twice the log-likelihood score. It is popular because, for normally distributed data, the Deviance is equivalent to the Mean Square Error. JAGS reports the deviance as it is used as part of the Deviance Information Criterion, whereas Stan reports the log-posterior score (see definition below)
- Cross validation (CV): A technique whereby parts of the data are removed before fitting the model, and subsequently predicted from the new model output. CV is valuable because it tests the model performance on data the model has not seen. Common versions of CV include k-fold CV whereby the data are split into k groups and each is left out (and subsequently predicted) in turn, or leave-one-out (LOO-CV) where each observation is left out (and subsequently predicted) in turn.
- Hierarchical model: A model where some parameters are given further prior distributions that depend on other parameters.
- Markov chain Monte Carlo: A method for simulating values from a posterior distribution given a likelihood and a prior (and a data set). The method works by guessing initial values for the parameters, and scoring them against the likelihood and the prior. It subsequently updates the parameters and compares them with the previous score. Over thousands of iterations the model should eventually s
- Hyper-parameter: A lower level parameter (i.e. further removed from the data) that depends on no other parameters in a hierarchical model
- Rhat: A measure of how well a parameter has converged to the posterior distribution. Values less than 1.1 are usually considered acceptable. Will only work when for runs with multiple different starting values (multiple chains)
- Convergence monitoring: The means by which a set of parameters is considered to have converged to the true posterior distribution. The convergence is controlled (in JAGS and Stan) by the number of iterations required, the total burn-in period (the number of iterations removed at the start), and the amount of thinning (the number of successive iterations removed). For models which haven’t converged, it is common to increase one or more of these values.
- Mixed effects models: A Frequentist model containing both fixed effects (parameters which do not borrow strength across groups), and random effects (parameters which do borrow strength across groups). In a hierarchical models the aim is always to use the random effects approach.
- Marginal distribution: A probability distribution which does not depend on further (usually unknown) parameter values.
- Conditional distributions: A probability distribution that depends on further parameters and/or data.
- Conditional independence: A situation that occurs when two parameters are not linked directly in a DAG.
- Latent parameters: A parameter which is hidden in the data and can be estimated by using a statistical model (such as a hierarchical model)
lp__
: Part of the output of a Stan model that represents the log posterior score. This is the log of the likelihood times the prior (or equivalently the sum of the log-likelihood plus the sum of the log-priors). It represents the overall score that Stan is trying to maximise (and subsequently hover around) to create the poserior distribution