Summary: Distributions reveal useful information, but the information is probabilistic.

- Models predict more than the range of values a group of points will
fall into. They also predict the
*distribution*of the points.- A distribution contains information about probabilities associated with the points. For example, distributions reveal which values are typical, which are rare, and which are seemingly impossible. A distribution also reveals the “best guess” for predicting future values, as well as how certain or uncertain that guess may be.
- You can think of a distribution as the “boundary” conditions on a variable’s values.

- The easiest way to understand a distribtuion is to visualize it.
`library(mosaic); library(lattice)`

`dotPlot()`

`histogram()`

`freqpolygon()`

`densityplot()`

- So far we have examined a continuous variable. A continuous variable
can take any value on a segment of the continuous number line. Other
variables are
*discrete*, the values of a discrete variable fall in a countable set of values that may or may not have an implied order.- Visualize a discrete distribution with a bar graph.
`bargraph()`

- The most useful properties of a distribution can be described with
familiar statistics.
- Statistic defined - a number computed with an algorithm, that describes a group of individuals
- Motivating examples

- Typical values
`median()`

- “typical” value`mean()`

- best guess`min()`

- smallest value`max()`

- largest value- Describe a prediction as $\bar{Y} + \epsilon$, where
*ϵ*denotes the structure of the distribution.

- Uncertainty
- The “more” a variable varies, the less certain you can be when
predicting its values. The spread of the distribution quantifies
this uncertainty.
`range()`

`var()`

`sd()`

`IQR()`

`bwplot()`

- The “more” a variable varies, the less certain you can be when
predicting its values. The spread of the distribution quantifies
this uncertainty.

- What is the relationship between a distribution and an individual case?
- You can use probability to reason from a distribution to individual
cases
- Probability defined as a frequency. Each variable takes each value with a certain frequency.
- If the next observation is similar to the observations in your distribution, it will have the same probability of taking each state as the previous observations.

- Simulation. You can use the frequencies of a distribution to
simulate new values from the distribution, a technique known as
monte carlo simulation.
`sample()`

`resample()`

`do()`

- prediction intervals
`quantile()`

- How do you know that you’ve collected enough data to have an accurate picture of a variable’s distribution?
- In some situations, you can deduce the type of distribution that
your data follows
`plotDist()`

- Binomial distribution
- Normal distribution
- t distribution
- Chi squared
- F
- Possion distribution
- uniform
- etc.

- In these cases, calculations become simple
`rnorm()`

, etc.`pnorm()`

, etc.`xpnorm()`

, etc.`dnorm()`

, etc.`qnorm()`

, etc.

- Before modern computers, statisticians relied heavily on parameteric
distributions.
- A common pattern of reasoning was to
- Assume that data follows a distribution
- Try to disprove the assumption:
`qqmath()`

,`xqqmath()`

,`qqplot()`

- Goodness of fit tests

- Proceed as if the assumption were true

- But this reasoning has a weakness: it relies on an assumption
that has not been proven. Moreover, we’re tempted to believe
that the assumption is true because the assumption will help us.
- Unfortunately, tests that would disprove the assumption if it were false are not very powerful
- Once you assume a distribution, all of your results are “conditional” on the assumption being correct. This makes it very difficult to interpret parametric probabilities.

- A common pattern of reasoning was to
- Modern computing makes it easier to avoid parametric assumptions
- It is easier to collect more data
- It is easier to calculate probabilities computationally, which reduces the need to calculate them analytically (from a distribution)
- However, it is important to understand parametric distributions because many methods of data science rely on them

- Distributions provide a tool for working with variation.
- Variation is what makes the world dynamic and uncertain.
- As a scientist, your goal is to understand and explain variation (natural laws explain how a variable varies), which makes distributions a very important tool.

- Variation defined - the tendency for a value to change each time we measure it
- Variation is omnipresent
- Example of speed of light

- Causes of variation
- unidentified laws
- randomness

- Models partition the variation in a variable into explained
variation and unexplained distribution.
- The distribution predicted by the model captures the unexplained variation
- As you add components of the law to your model, you will transfer unexplained variation to explained variation. As a result, the distributions predicted by the model will become narrower, and the individual predictions more uncertain.
- Eventually, a distribution will collapse to a single point as you include the entire law in your model.

- Scientists search for natural laws. A complete law provides a function between values and a value. An incomplete law is a model. It provides a function between values and a distribution.
- Distributions reveal useful information, but the information is probabilistic.
- As models improve (include more components of the underlying law),
the distributions that they predict become narrower.
- As a result, their individual predictions become more certain.
- If a model includes all of the components of a law, its distributions will collapse into single points. The model has become a law, a function from a set of values to a single value.

- Unfortunately, the nature of distributions make models a little more
difficult to work with than laws. Recall the two steps of the
scientific method.
- Identify laws as patterns in data
- Which function is “correct”?

- Test laws against new data
- Does the data “disprove” the model?

- Identify laws as patterns in data
- Due to these problems, data scientists must modify the tactics they use to implement the strategy of the scientific method. We’ll examine these tactics in the rest of this Part.