**Statistical Analysis Overview**

^{1}

**Major Entities**
There
are two major entities in Applied Statistics:

•
Data

•
Meta Data

**Data**
Data
comprise the raw sample information we collect as well as the results
of analyzing the samples.

Canonical
notation presents these as multivariate variables

^{2}in an array,*e.g*.:
A sequence or
other collection of random variables is independent
and identically distributed ("i.i.d.") if each random
variable has the same probability distribution as the
others and all are mutually independent.

^{3}
The
results of a function of a variable are also data:

**R = F(X)**

These
are our test data for the

**R Project for Statistical Computing**^{4}program that we are using:
We
will use these for demonstration in the remainder of this paper.

**Metadata**
Metadata
present the overlying processes and methods that interrelate the
data.

Processes,
functions, and analytic parameters are metadata.

The
processes of statistical analysis comprise models and tests on both
the model and its results.

^{5}

**Statistics**

**Descriptive Statistics**
We
build a mathematical model (a “regression”) to describe the
relationships between input variables and the observed results. Most
frequently the model is a linear regression of the form:

**[R] = [A]x[X]+[B]**

in
standard matrix algebra notation.

__Correlation analysis__

^{6}
The first step is to validate the model. We do
this with correlation analysis to determine how closely the model
matches the observed samples. The sample data are used to compute

*r*, the correlation coefficient for the sample. The symbol for the population correlation coefficient is*ρ*, the Greek letter "rho":
• ρ = population correlation
coefficient (unknown)

• r = sample correlation coefficient
(known; calculated from sample data)

If the test concludes that the correlation
coefficient is significantly different from 0, we say that the
correlation coefficient is "significant".

Significance is indicated by the value:

*α*= 1-

*ρ*

*α*= 5 indicates a 95% correlation and is considered “significant”.

__Factor Analysis__

The analyst must seek the causes if the model
does not adequately match the samples. Factor Analysis is one tool
for this purpose.

**Statistical Control**
The
process of statistical quality control3

^{7}is one of determining whether a process and its results are “under control” or “out of control”.
A
process that is operating with only chance causes of variation
present is said to be

**in statistical control**.
A
process that is operating in the presence of assignable causes is
said to be an

**out-of-control**process. A process is considered to be out of control when its results exceed the Upper Specification Limit (USL) or Lower Specification Limit (LSL):__Control Charts__

The USL/LSL correspond to the Upper Control
Limit (UCL) or Lower Control Limit (LCL) in a control chart. These
limits typically are taken to be three standard deviations (3

*σ*) above and below the process mean:
The latter chart is an example of the R program
output.

**Hypothesis Testing**^{8}
A
hypothesis test examines two opposing hypotheses about a population:
the null hypothesis and the alternative hypothesis. The null
hypothesis is the statement being tested. Usually the null hypothesis
is a statement of "no effect" or "no difference".
The alternative hypothesis is the statement you want to be able to
conclude is true.

Based
on the sample data, the test determines whether to reject the null
hypothesis (to decide that the second hypothesis is correct). You use
a “

*p*-value”, to make the determination. If the*p*-value is less than or equal to the level of significance α then you can reject the null hypothesis.

*p*-value
The

*p*-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true. In layman's terms, it is the probability of being wrong by rejecting the null hypothesis. So we reject the null hypothesis when the*p*-value is sufficiently small, that is, less than the significance level*α*, which is a cut-off point that you define.
We cannot know the exact

*p*-value, but there are a number of different tests for*estimating*the*p*-value depending on the known characteristics of the sample sets at hand:^{9}####
t-test^{10}

This delivers a random variable

*t*that approximates the*p*-value. A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations’ standard deviation and when you have a*limited*sample (n < 30).
R returns the following paired t-test result:

data: y and V2

t = -4.2636, df =
5, p-value = 0.007987

indicating a close correlation between y and V2

#### z-test

A z-test is used for testing the mean of a
population versus a standard, or comparing the means of two
populations, with

*large*(n ≥ 30) samples whether you know the population standard deviation or not.__Other tests__

We may wish to compare other statistics
(characteristics) of different sample sets.

#### F-test

An F-test is used to decide if 2 populations’
variances are the same, assuming both populations are normally
distributed. The samples can be any size.

#### Levene's test

Levene's test is used to decide if 2
populations’ variances are the same, assuming both populations are
continuous but NOT normally distributed.

####
Anderson–Darling test^{11}

The Anderson–Darling test is
a statistical test of whether a given sample of data is
drawn from a given probability distribution.

#### Confidence interval

A confidence interval is a range of likely
values for a population parameter (such as the mean

*μ*) that is based on sample data.
Use a confidence interval to make inferences
about one or more populations from sample data, or to quantify the
precision of your estimate of a population parameter, such as

*μ*.#### Test and CI for Two Variances

This calculates the ratio of the variances (Σ)
of two sample sets.

This
summarizes the various tests:

^{12}
1©Privus
Technologies LLC, P.O. Box 149, Newport, RI 02840, 2017

21
Richard A. Johnson and Dean W. Wichern,

*Applied Multivariate Statistical Analysis*, 6^{th}Ed., ISBN 0-13-187715-1
5Douglas
C Montgomery, Introduction to Statistical Quality Control, 6th
Edition, ISBN 978-0-470-16992-6

6Barbara
Illowsky. “Testing the Significance of the Correlation
Coefficient.” Collaborative Statistics Boundless, 26 May. 2016.
Retrieved from
https://www.boundless.com/users/235422/textbooks/collaborative-statistics/linear-regression-and-correlation-13/testing-the-significance-of-the-correlation-coefficient-181/testing-the-significance-of-the-correlation-coefficient-424-15972/

7Montgomery,
Section 5.2