Statistical Analysis Overview
Major Entities
There
are two major entities in Applied Statistics:
•
Data
•
Meta Data
Data
Data
comprise the raw sample information we collect as well as the results
of analyzing the samples.
Canonical
notation presents these as multivariate variables
in an array, e.g.:
A sequence or
other collection of random variables is independent
and identically distributed ("i.i.d.") if each random
variable has the same probability distribution as the
others and all are mutually independent.
The
results of a function of a variable are also data:
R
= F(X)
These
are our test data for the R Project for Statistical Computing
program that we are using:
We
will use these for demonstration in the remainder of this paper.
Metadata
Metadata
present the overlying processes and methods that interrelate the
data.
Processes,
functions, and analytic parameters are metadata.
The
processes of statistical analysis comprise models and tests on both
the model and its results.
Statistics
Descriptive Statistics
Descriptive
Statistics comprise calculations that describe the sample set:
Models
We
build a mathematical model (a “regression”) to describe the
relationships between input variables and the observed results. Most
frequently the model is a linear regression of the form:
[R]
= [A]x[X]+[B]
in
standard matrix algebra notation.
Correlation analysis
The first step is to validate the model. We do
this with correlation analysis to determine how closely the model
matches the observed samples. The sample data are used to compute r,
the correlation coefficient for the sample. The symbol for the
population correlation coefficient is ρ, the Greek letter
"rho":
• ρ = population correlation
coefficient (unknown)
• r = sample correlation coefficient
(known; calculated from sample data)
If the test concludes that the correlation
coefficient is significantly different from 0, we say that the
correlation coefficient is "significant".
Significance is indicated by the value:
α = 1-ρ
α = 5 indicates a 95% correlation and
is considered “significant”.
Factor Analysis
The analyst must seek the causes if the model
does not adequately match the samples. Factor Analysis is one tool
for this purpose.
Statistical Control
The
process of statistical quality control3
is one of determining whether a process and its results are “under
control” or “out of control”.
A
process that is operating with only chance causes of variation
present is said to be in statistical control.
A
process that is operating in the presence of assignable causes is
said to be an out-of-control process. A process is considered
to be out of control when its results exceed the Upper Specification
Limit (USL) or Lower Specification Limit (LSL):
Control Charts
The USL/LSL correspond to the Upper Control
Limit (UCL) or Lower Control Limit (LCL) in a control chart. These
limits typically are taken to be three standard deviations (3σ)
above and below the process mean:
The latter chart is an example of the R program
output.
Hypothesis Testing
A
hypothesis test examines two opposing hypotheses about a population:
the null hypothesis and the alternative hypothesis. The null
hypothesis is the statement being tested. Usually the null hypothesis
is a statement of "no effect" or "no difference".
The alternative hypothesis is the statement you want to be able to
conclude is true.
Based
on the sample data, the test determines whether to reject the null
hypothesis (to decide that the second hypothesis is correct). You use
a “p-value”, to make the determination. If the p-value
is less than or equal to the level of significance α then you can
reject the null hypothesis.
p-value
The p-value is defined as the
probability of obtaining a result equal to or "more extreme"
than what was actually observed, when the null hypothesis is
true. In layman's terms, it is the probability of being wrong by
rejecting the null hypothesis. So we reject the null hypothesis when
the p-value is sufficiently small, that is, less than the
significance level α, which is a cut-off point that you
define.
We cannot know the exact p-value, but
there are a number of different tests for estimating the
p-value depending on the known characteristics of the sample
sets at hand:
t-test
This delivers a random variable t that
approximates the p-value. A t-test is used for testing the
mean of one population against a standard or comparing the means of
two populations if you do not know the populations’ standard
deviation and when you have a limited sample (n < 30).
R returns the following paired t-test result:
data: y and V2
t = -4.2636, df =
5, p-value = 0.007987
indicating a close correlation between y and V2
z-test
A z-test is used for testing the mean of a
population versus a standard, or comparing the means of two
populations, with large (n ≥ 30) samples whether you know
the population standard deviation or not.
Other tests
We may wish to compare other statistics
(characteristics) of different sample sets.
F-test
An F-test is used to decide if 2 populations’
variances are the same, assuming both populations are normally
distributed. The samples can be any size.
Levene's test
Levene's test is used to decide if 2
populations’ variances are the same, assuming both populations are
continuous but NOT normally distributed.
Anderson–Darling test
The Anderson–Darling test is
a statistical test of whether a given sample of data is
drawn from a given probability distribution.
Confidence interval
A confidence interval is a range of likely
values for a population parameter (such as the mean μ) that
is based on sample data.
Use a confidence interval to make inferences
about one or more populations from sample data, or to quantify the
precision of your estimate of a population parameter, such as μ.
Test and CI for Two Variances
This calculates the ratio of the variances (Σ)
of two sample sets.
This
summarizes the various tests: