Distributions
Chapter 2 - Distributions
Summary: Distributions reveal useful information, but the information is
probabilistic.
1. Distributions
- Models predict more than the range of values a group of points will
fall into. They also predict the distribution of the points.
- A distribution contains information about probabilities
associated with the points. For example, distributions reveal
which values are typical, which are rare, and which are
seemingly impossible. A distribution also reveals the “best
guess” for predicting future values, as well as how certain or
uncertain that guess may be.
- You can think of a distribution as the “boundary” conditions on
a variable’s values.
2. Visualizing Distributions
- The easiest way to understand a distribtuion is to visualize it.
library(mosaic); library(lattice)
dotPlot()
histogram()
freqpolygon()
densityplot()
- So far we have examined a continuous variable. A continuous variable
can take any value on a segment of the continuous number line. Other
variables are discrete, the values of a discrete variable fall in
a countable set of values that may or may not have an implied order.
- Visualize a discrete distribution with a bar graph.
bargraph()
3. Statistics
- The most useful properties of a distribution can be described with
familiar statistics.
- Statistic defined - a number computed with an algorithm, that
describes a group of individuals
- Motivating examples
- Typical values
median()
- “typical” value
mean()
- best guess
min()
- smallest value
max()
- largest value
- Describe a prediction as $\bar{Y} + \epsilon$, where ϵ
denotes the structure of the distribution.
- Uncertainty
- The “more” a variable varies, the less certain you can be when
predicting its values. The spread of the distribution quantifies
this uncertainty.
range()
var()
sd()
IQR()
bwplot()
4. Probability
- What is the relationship between a distribution and an individual
case?
- You can use probability to reason from a distribution to individual
cases
- Probability defined as a frequency. Each variable takes each
value with a certain frequency.
- If the next observation is similar to the observations in your
distribution, it will have the same probability of taking each
state as the previous observations.
- Simulation. You can use the frequencies of a distribution to
simulate new values from the distribution, a technique known as
monte carlo simulation.
sample()
resample()
do()
- prediction intervals
quantile()
5. Parametric distributions
- How do you know that you’ve collected enough data to have an
accurate picture of a variable’s distribution?
- In some situations, you can deduce the type of distribution that
your data follows
plotDist()
- Binomial distribution
- Normal distribution
- t distribution
- Chi squared
- F
- Possion distribution
- uniform
- etc.
- In these cases, calculations become simple
rnorm()
, etc.
pnorm()
, etc.
xpnorm()
, etc.
dnorm()
, etc.
qnorm()
, etc.
- Before modern computers, statisticians relied heavily on parameteric
distributions.
- A common pattern of reasoning was to
- Assume that data follows a distribution
- Try to disprove the assumption:
qqmath()
, xqqmath()
, qqplot()
- Goodness of fit tests
- Proceed as if the assumption were true
- But this reasoning has a weakness: it relies on an assumption
that has not been proven. Moreover, we’re tempted to believe
that the assumption is true because the assumption will help us.
- Unfortunately, tests that would disprove the assumption if
it were false are not very powerful
- Once you assume a distribution, all of your results are
“conditional” on the assumption being correct. This makes it
very difficult to interpret parametric probabilities.
- Modern computing makes it easier to avoid parametric assumptions
- It is easier to collect more data
- It is easier to calculate probabilities computationally, which
reduces the need to calculate them analytically (from a
distribution)
- However, it is important to understand parametric distributions
because many methods of data science rely on them
5. Variation
- Distributions provide a tool for working with variation.
- Variation is what makes the world dynamic and uncertain.
- As a scientist, your goal is to understand and explain variation
(natural laws explain how a variable varies), which makes
distributions a very important tool.
- Variation defined - the tendency for a value to change each time we
measure it
- Variation is omnipresent
- Example of speed of light
- Causes of variation
- unidentified laws
- randomness
- Models partition the variation in a variable into explained
variation and unexplained distribution.
- The distribution predicted by the model captures the unexplained
variation
- As you add components of the law to your model, you will
transfer unexplained variation to explained variation. As a
result, the distributions predicted by the model will become
narrower, and the individual predictions more uncertain.
- Eventually, a distribution will collapse to a single point as
you include the entire law in your model.
6. Summary
- Scientists search for natural laws. A complete law provides a
function between values and a value. An incomplete law is a model.
It provides a function between values and a distribution.
- Distributions reveal useful information, but the information is
probabilistic.
- As models improve (include more components of the underlying law),
the distributions that they predict become narrower.
- As a result, their individual predictions become more certain.
- If a model includes all of the components of a law, its
distributions will collapse into single points. The model has
become a law, a function from a set of values to a single value.
- Unfortunately, the nature of distributions make models a little more
difficult to work with than laws. Recall the two steps of the
scientific method.
- Identify laws as patterns in data
- Which function is “correct”?
- Test laws against new data
- Does the data “disprove” the model?
- Due to these problems, data scientists must modify the tactics they
use to implement the strategy of the scientific method. We’ll
examine these tactics in the rest of this Part.