Model Fitting
Chapter 3 - Model Fitting
Summary: Laws look like continuous sequences of points in data, but
models look like noisy clouds of points. As a result, it is much harder
to reverse engineer a model from data than it is to reverse engineer a
law. How should we pick the model that best describes the data? We can
use inverse probability, or likelihood to find the model that is most
likely to have generated the data. This allows us to fit many different
types of models. Linear. Linear discrete. Linear multivariate. Linear
multivariate with interaction terms. Generalized linear models. And
non-linear Generalized Additive Models.
1. How to fit?
- Both laws and models appear as patterns in data. However it is much
easier to spot a law with these patterns than it is to spot a model.
- Why should this be? For a law, each data point falls on the
function that describes the law.
- But for a model, each data point falls around the function
that describes the model. In fact, the data points are
distributed around the function (as you know).
- With enough points, you may be able to notice the structure
of these distributions, and therefore make a good guess
about the form of the model, but you will rarely have this
many points.
- More commonly you will need to make the best guess that you
can about the model given the data that you have.
- In practice, you use an algorithm to select the best guess
and R takes care of the details
lm()
xyplot()
, makeFun()
, plotFun(add = TRUE)
- But what is R doing?
- Many algorithms exist to help you spot the “best” model given the
data. In this chapter we will look at how the most popular
algorithms work and learn to use them. In the following chapter, we
will consider how to choose between the algorithms when modeling a
particular data set.
2. Likelihood, the inverse of probability
- The most intuitive modeling algorithms rely on likelihood. In short,
they pick the model that is most likely to have generated the
data.
- We use the term likely in everyday speech, but in science
likelihood has a specific meaning that is closely related to
probability.
- Probability describes the chance that a certain observation will
come to pass given a model. It reasons from model to (future)
observation.
- Likelihood describes the chance that a given model caused the
observation that has come to pass. It reasons from observation
to model. For this reason, likelihood is sometimes called
inverse probability.
- Although likelihood and probability both deal with chances
they behave differently mathematically. Most notably, the
probability of disjoint events must sum to one. Not so for
likelihoods.
- You can calculate the likelihood that a specific model produced a
set of data.
- Example of a very simple model
goal(y ~ 1, data = ...)
lm()
- To find the most likely model compare the likelihoods of
different models and choose the model with the highest
likelihood.
- There is no guarantee that the model with the highest likelihood
generated the data, but it is the most pragmatic conclusion to
draw. This method of reasoning is known as abduction.
- In practice, there is a logistical issue to address. There is an
infinite number of models to choose from. How will you ever stop
calculating likelihoods?
- Well, the same way you manage possibilities whenever you make a
decision in real life. You first narrow the possibilities down
to a smaller set for further consideration. Each modeling
algorithm only applies to a small set of models, the set of
models that have the structure the algorithm optimizes
likelihoods over.
- By choosing to use an algorithm, you narrow the set of
models to consider to a tractable number. The algorithm then
identifies the most likely model within that set.
- You should choose the algorithm that matches the structure
of the natural law that you wish to model (if you know this
structure).
- In the next chapter, we will consider how to proceed when
you do not know anything about this structure.
3. Fitting a linear model
- Linear models are among the oldest and most interpretable modeling
methods. A linear model uses a linear function to map a set of
values to a set of normal distributions.
- Linear models are widely useful because
- The normal distribution occurs frequently in the natural
world.
- Any continuous function can be approximated well with a
straight line over a short distance.
- Linear models also have another useful feature. For linear
models (and generalized linear models with exponential
distributions), the model with maximum likelihood will be the
model that has the smallest residual sum of squares.
- A residual is the distance between the mean of the
distribution predicted by a model and the actual data point
observed.
- Notice that residuals can be positive or negative. Squaring
the residuals is a way to measure the magnitude of a
residual.
- Hence the least squares model will be the model that has the
smallest squared residuals, i.e. the model that comes
closest to the data points.
- The
lm()
function fits a linear model to data.
goal(y ~ x, data = ...)
lm()
- R’s modeling functions return an object that contains a lot of data.
To access the data, store then explore.
resid()
coef()
fitted()
xyplot()
, makeFun()
, plotFun(add = TRUE)
- Linear models are particularly easy to interpret. The coefficient of
X is the number of units that the best guess of Y ($\hat{Y}$)
increases as X increases by one unit.
4. Discrete terms
- You can also apply linear models to discrete terms.
- Consider this example. To explore the data:
diffmean()
tally()
prop()
perc()
- Build your model as you normally would
goal(y ~ 1 | z, data = ...)
lm()
- Interpretation
- R will provide a coefficient for each level of the discrete
variable except one. This level will be used as a baseline.
- The intercept is the best guess of Y for the baseline group
- Each β coefficient is a modifier. It shows how to modify the
baseline coefficient to determine the best guess of Y for each
remaining group. In other words, it shows the change in the best
guess of Y that results from switching from the baseline group
to the coefficient’s group.
- Use
factor()
to change the baseline group. Your coefficients
will change, but the final results will not.
5. Multivariate models
- Add additional terms to the formula to include additional predictors
in your model
goal(y ~ x | z, data = ...)
goal(y ~ x + z, data = ...)
- Now each coefficient should be interpreted as the change in the best
guess of Y that results from changing Xi by one
unit while holding the values of the other *Xj
constant*.
6. Interaction terms
- Multivariate models create the possibility of interaction effects.
An interaction effect occurs when the values of one variable modify
the effect of another variable.
- A visual explanation
- Notation
goal(y ~ x + z + x:z, data = ...)
goal(y ~ x*z, data = ...)
- Interpretation
7. Generalized models
lm()
uses likelihood to find the model that maps values of X to
distributions in Y that have a normal distribution. We saw in
Chapter 2 that many natural events will have a normal distribution,
but other will not. What if we want to use a model that maps values
to non-normal distributions? (This would be appropriate if we are
modeling a non-normal Y or even a discrete Y)
- This is an important change because the distributions associated
with a model determine how well it fits. They determine how
likely the model is to generate the data.
- It is easy to generalize linear models to non-normal cases by
modeling a function of Y.
- Compare linear model equation to glm model equation
- Such functions are known as link functions. They map
non-normal input (Y) to normal output, which can be fitted in
the usual way.
- These models are known as generalized linear models (GLM)
- The most common form of generalized linear models are logisitic
models
glm(..., family = ...)
options(na.action = "na.exclude")
predict(mod, type = "link"); predict(mod, type = "link")
resid(gmod, type = "deviance"); resid(gmod, type = "pearson")
- To interpret a generalized linear model, back transform the
coefficients through the link function, or simply try to understand
the predictions
- Interpreting a logistic regression
8. Non-linear models
- We can use a similar strategy to model non-linear relationships.
Instead of modeling Y on X, we can model Y on a function of
X.
- If you have a particular function of X in mind, you can put it
straight into the model equation.
lm()
- Don’t forget to back transform your results when
interpreting
- General additive models (GAMS) fit a model that maps smooth
functions of the Xi to distributions in Y. These
functions do not need to be linear, only smooth, which makes GAM
algorithms useful for mapping many types of relationships.
- GAM model equation
library(mgcv)
gam()
s()
- interactions
- As before, we are using likelihood to identify the best model, but
our low level tactics have changed. The model with the highest
likelihood will be the model whose X functions put the model line
exactly through each data point, something that may now be feasible.
However, such a model is as unlikely to be true in the practical
sense as it is likely to be true in the mathematical sense. As a
result,
gam()
uses a penalized iterative method to select the most
likely sensible model.
- You can combine generalized linear methods and generalized additive
methods with
gam()
.
- Model equation
- Hence, you can think of
gam()
as a type of generalized
modeling algorithm
- Interpreting the results of GAMs is difficult. For GAMs as well as
other modeling methods that we will encounter later, interpret the
model through its predictions.
9. Summary
- At its heart, model fitting is an optimization algorithm. Each of
the methods above optimizes a likelihood function to find the “best
fitting” model.
- Recommended reading for the mathematics behind model fitting:
The Elements of Statistical Learning
- Each of these methods finds the best parametric model to fit your
data. It is hard to describe a model (which must describe all
possible data points) without using a parametric distribution.
- We will look at some non-parametric models in Chapter 6.
- This chapter covered the most popular (and the most accessible)
methods of model fitting, but many more modeling algorithms exist.
- Additional reference for other methods: Applied Predictive
modeling
- How do you know which method you should use with your data?
Chapter 4 will begin with this question.