Summary: Laws look like continuous sequences of points in data, but
models look like noisy clouds of points. As a result, it is much harder
to reverse engineer a model from data than it is to reverse engineer a
law. How should we pick the model that best describes the data? We can
use inverse probability, or likelihood to find the model that is *most
likely* to have generated the data. This allows us to fit many different
types of models. Linear. Linear discrete. Linear multivariate. Linear
multivariate with interaction terms. Generalized linear models. And
non-linear Generalized Additive Models.

- Both laws and models appear as patterns in data. However it is much
easier to spot a law with these patterns than it is to spot a model.
- Why should this be? For a law, each data point falls
*on*the function that describes the law. - But for a model, each data point falls
*around*the function that describes the model. In fact, the data points are*distributed*around the function (as you know).- With enough points, you may be able to notice the structure of these distributions, and therefore make a good guess about the form of the model, but you will rarely have this many points.
- More commonly you will need to make the best guess that you can about the model given the data that you have.
- In practice, you use an algorithm to select the best guess
and R takes care of the details
`lm()`

`xyplot()`

,`makeFun()`

,`plotFun(add = TRUE)`

- But what is R doing?

- Why should this be? For a law, each data point falls
- Many algorithms exist to help you spot the “best” model given the data. In this chapter we will look at how the most popular algorithms work and learn to use them. In the following chapter, we will consider how to choose between the algorithms when modeling a particular data set.

- The most intuitive modeling algorithms rely on likelihood. In short,
they pick the model that is most
*likely*to have generated the data. - We use the term likely in everyday speech, but in science
*likelihood*has a specific meaning that is closely related to probability.- Probability describes the chance that a certain observation will come to pass given a model. It reasons from model to (future) observation.
- Likelihood describes the chance that a given model caused the
observation that has come to pass. It reasons from observation
to model. For this reason, likelihood is sometimes called
inverse probability.
- Although likelihood and probability both deal with chances they behave differently mathematically. Most notably, the probability of disjoint events must sum to one. Not so for likelihoods.

- You can calculate the likelihood that a specific model produced a
set of data.
- Example of a very simple model
`goal(y ~ 1, data = ...)`

`lm()`

- To find the most likely model compare the likelihoods of different models and choose the model with the highest likelihood.
- There is no guarantee that the model with the highest likelihood generated the data, but it is the most pragmatic conclusion to draw. This method of reasoning is known as abduction.

- Example of a very simple model
- In practice, there is a logistical issue to address. There is an
infinite number of models to choose from. How will you ever stop
calculating likelihoods?
- Well, the same way you manage possibilities whenever you make a
decision in real life. You first narrow the possibilities down
to a smaller set for further consideration. Each modeling
algorithm only applies to a small set of models, the set of
models that have the structure the algorithm optimizes
likelihoods over.
- By choosing to use an algorithm, you narrow the set of models to consider to a tractable number. The algorithm then identifies the most likely model within that set.
- You should choose the algorithm that matches the structure of the natural law that you wish to model (if you know this structure).
- In the next chapter, we will consider how to proceed when you do not know anything about this structure.

- Well, the same way you manage possibilities whenever you make a
decision in real life. You first narrow the possibilities down
to a smaller set for further consideration. Each modeling
algorithm only applies to a small set of models, the set of
models that have the structure the algorithm optimizes
likelihoods over.

- Linear models are among the oldest and most interpretable modeling
methods. A linear model uses a linear function to map a set of
values to a set of normal distributions.
- Linear models are widely useful because
- The normal distribution occurs frequently in the natural world.
- Any continuous function can be approximated well with a straight line over a short distance.

- Linear models also have another useful feature. For linear
models (and generalized linear models with exponential
distributions), the model with maximum likelihood will be the
model that has the smallest residual sum of squares.
- A residual is the distance between the mean of the distribution predicted by a model and the actual data point observed.
- Notice that residuals can be positive or negative. Squaring the residuals is a way to measure the magnitude of a residual.
- Hence the least squares model will be the model that has the smallest squared residuals, i.e. the model that comes closest to the data points.

- Linear models are widely useful because
- The
`lm()`

function fits a linear model to data.`goal(y ~ x, data = ...)`

`lm()`

- R’s modeling functions return an object that contains a lot of data.
To access the data,
*store then explore*.`resid()`

`coef()`

`fitted()`

`xyplot()`

,`makeFun()`

,`plotFun(add = TRUE)`

- Linear models are particularly easy to interpret. The coefficient of
*X*is the number of units that the best guess of*Y*($\hat{Y}$) increases as*X*increases by one unit.

- You can also apply linear models to discrete terms.
- Consider this example. To explore the data:
`diffmean()`

`tally()`

`prop()`

`perc()`

- Build your model as you normally would
`goal(y ~ 1 | z, data = ...)`

`lm()`

- Consider this example. To explore the data:
- Interpretation
- R will provide a coefficient for each level of the discrete variable except one. This level will be used as a baseline.
- The intercept is the best guess of
*Y*for the baseline group - Each
*β*coefficient is a modifier. It shows how to modify the baseline coefficient to determine the best guess of*Y*for each remaining group. In other words, it shows the change in the best guess of*Y*that results from switching from the baseline group to the coefficient’s group. - Use
`factor()`

to change the baseline group. Your coefficients will change, but the final results will not.

- Add additional terms to the formula to include additional predictors
in your model
`goal(y ~ x | z, data = ...)`

`goal(y ~ x + z, data = ...)`

- Now each coefficient should be interpreted as the change in the best
guess of
*Y*that results from changing*X*_{i}by one unit*while holding the values of the other *X*_{j}constant*.

- Multivariate models create the possibility of interaction effects.
An interaction effect occurs when the values of one variable modify
the effect of another variable.
- A visual explanation

- Notation
`goal(y ~ x + z + x:z, data = ...)`

`goal(y ~ x*z, data = ...)`

- Interpretation

`lm()`

uses likelihood to find the model that maps values of*X*to distributions in*Y*that have a*normal*distribution. We saw in Chapter 2 that many natural events will have a normal distribution, but other will not. What if we want to use a model that maps values to non-normal distributions? (This would be appropriate if we are modeling a non-normal*Y*or even a discrete*Y*)- This is an important change because the distributions associated with a model determine how well it fits. They determine how likely the model is to generate the data.

- It is easy to generalize linear models to non-normal cases by
modeling a function of
*Y*.- Compare linear model equation to glm model equation
- Such functions are known as
*link functions*. They map non-normal input (*Y*) to normal output, which can be fitted in the usual way. - These models are known as
*generalized linear models (GLM)* - The most common form of generalized linear models are logisitic models
`glm(..., family = ...)`

`options(na.action = "na.exclude")`

`predict(mod, type = "link"); predict(mod, type = "link")`

`resid(gmod, type = "deviance"); resid(gmod, type = "pearson")`

- To interpret a generalized linear model, back transform the
coefficients through the link function, or simply try to understand
the predictions
- Interpreting a logistic regression

- We can use a similar strategy to model non-linear relationships.
Instead of modeling
*Y*on*X*, we can model*Y*on a function of*X*.- If you have a particular function of
*X*in mind, you can put it straight into the model equation.`lm()`

- Don’t forget to back transform your results when interpreting

- If you have a particular function of
*General additive models (GAMS)*fit a model that maps smooth functions of the*X*_{i}to distributions in*Y*. These functions do not need to be linear, only smooth, which makes GAM algorithms useful for mapping many types of relationships.- GAM model equation
`library(mgcv)`

`gam()`

`s()`

- interactions

- As before, we are using likelihood to identify the best model, but
our low level tactics have changed. The model with the highest
likelihood will be the model whose
*X*functions put the model line exactly through each data point, something that may now be feasible. However, such a model is as unlikely to be true in the practical sense as it is likely to be true in the mathematical sense. As a result,`gam()`

uses a penalized iterative method to select the most likely sensible model. - You can combine generalized linear methods and generalized additive
methods with
`gam()`

.- Model equation
- Hence, you can think of
`gam()`

as a type of generalized modeling algorithm

- Interpreting the results of GAMs is difficult. For GAMs as well as other modeling methods that we will encounter later, interpret the model through its predictions.

- At its heart, model fitting is an optimization algorithm. Each of
the methods above optimizes a likelihood function to find the “best
fitting” model.
- Recommended reading for the mathematics behind model fitting: The Elements of Statistical Learning

- Each of these methods finds the best
*parametric*model to fit your data. It is hard to describe a model (which must describe all possible data points) without using a parametric distribution.- We will look at some
*non-parametric*models in Chapter 6.

- We will look at some
- This chapter covered the most popular (and the most accessible)
methods of model fitting, but many more modeling algorithms exist.
- Additional reference for other methods: Applied Predictive modeling
- How do you know which method you should use with your data? Chapter 4 will begin with this question.