Thursday, January 26, 2017

Statistical Analysis Overview

Statistical Analysis Overview1
Major Entities
There are two major entities in Applied Statistics:
• Data
• Meta Data
Data comprise the raw sample information we collect as well as the results of analyzing the samples.
Canonical notation presents these as multivariate variables2 in an array, e.g.:
A sequence or other collection of random variables is independent and identically distributed ("i.i.d.") if each random variable has the same probability distribution as the others and all are mutually independent.3
The results of a function of a variable are also data:
R = F(X)
These are our test data for the R Project for Statistical Computing4 program that we are using:
We will use these for demonstration in the remainder of this paper.
Metadata present the overlying processes and methods that interrelate the data.
Processes, functions, and analytic parameters are metadata.
The processes of statistical analysis comprise models and tests on both the model and its results.5
Descriptive Statistics
Descriptive Statistics comprise calculations that describe the sample set:
We build a mathematical model (a “regression”) to describe the relationships between input variables and the observed results. Most frequently the model is a linear regression of the form:
[R] = [A]x[X]+[B]
in standard matrix algebra notation.
Correlation analysis6
The first step is to validate the model. We do this with correlation analysis to determine how closely the model matches the observed samples. The sample data are used to compute r, the correlation coefficient for the sample. The symbol for the population correlation coefficient is ρ, the Greek letter "rho":
• ρ = population correlation coefficient (unknown)
• r = sample correlation coefficient (known; calculated from sample data)
If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is "significant".
Significance is indicated by the value:
α = 1-ρ
α = 5 indicates a 95% correlation and is considered “significant”.
Factor Analysis
The analyst must seek the causes if the model does not adequately match the samples. Factor Analysis is one tool for this purpose.
Statistical Control
The process of statistical quality control37 is one of determining whether a process and its results are “under control” or “out of control”.
A process that is operating with only chance causes of variation present is said to be in statistical control.
A process that is operating in the presence of assignable causes is said to be an out-of-control process. A process is considered to be out of control when its results exceed the Upper Specification Limit (USL) or Lower Specification Limit (LSL):
Control Charts
The USL/LSL correspond to the Upper Control Limit (UCL) or Lower Control Limit (LCL) in a control chart. These limits typically are taken to be three standard deviations (3σ) above and below the process mean:

The latter chart is an example of the R program output.
Hypothesis Testing8
A hypothesis test examines two opposing hypotheses about a population: the null hypothesis and the alternative hypothesis. The null hypothesis is the statement being tested. Usually the null hypothesis is a statement of "no effect" or "no difference". The alternative hypothesis is the statement you want to be able to conclude is true.
Based on the sample data, the test determines whether to reject the null hypothesis (to decide that the second hypothesis is correct). You use a “p-value”, to make the determination. If the p-value is less than or equal to the level of significance α then you can reject the null hypothesis.
The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true. In layman's terms, it is the probability of being wrong by rejecting the null hypothesis. So we reject the null hypothesis when the p-value is sufficiently small, that is, less than the significance level α, which is a cut-off point that you define.
We cannot know the exact p-value, but there are a number of different tests for estimating the p-value depending on the known characteristics of the sample sets at hand:9


This delivers a random variable t that approximates the p-value. A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations’ standard deviation and when you have a limited sample (n < 30).
R returns the following paired t-test result:
data: y and V2
t = -4.2636, df = 5, p-value = 0.007987
indicating a close correlation between y and V2


A z-test is used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n ≥ 30) samples whether you know the population standard deviation or not.
Other tests
We may wish to compare other statistics (characteristics) of different sample sets.


An F-test is used to decide if 2 populations’ variances are the same, assuming both populations are normally distributed. The samples can be any size.

Levene's test

Levene's test is used to decide if 2 populations’ variances are the same, assuming both populations are continuous but NOT normally distributed.

Anderson–Darling test11

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. 

Confidence interval

A confidence interval is a range of likely values for a population parameter (such as the mean μ) that is based on sample data.
Use a confidence interval to make inferences about one or more populations from sample data, or to quantify the precision of your estimate of a population parameter, such as μ.

Test and CI for Two Variances

This calculates the ratio of the variances (Σ) of two sample sets.
This summarizes the various tests:12

1©Privus Technologies LLC, P.O. Box 149, Newport, RI 02840, 2017
21 Richard A. Johnson and Dean W. Wichern, Applied Multivariate Statistical Analysis, 6th Ed., ISBN 0-13-187715-1
5Douglas C Montgomery, Introduction to Statistical Quality Control, 6th Edition, ISBN 978-0-470-16992-6
6Barbara Illowsky. “Testing the Significance of the Correlation Coefficient.” Collaborative Statistics Boundless, 26 May. 2016. Retrieved from
7Montgomery, Section 5.2

Thursday, January 12, 2017

Wireshark question: How to get Wireshark to see usbmon0?

I seem to have a habit of embarking on projects and getting to a point at which neither I nor anyone else seems able to find an answer.
Typically, I then go find a forum to post a question. This usually works. But this time it has not.
So, it occurred to me, since all my hours of Googling have failed me, perhaps I should try having Google do the work by posting the question to a blog. That way, anyone halfway interested in the components of my question will be directed here.It may be gratifying to them to know that someone else has the same problem. Perhaps then we can all contribute and we all win.
So here goes. I'll let you know how it works out... If you have thoughts please leave comments by clicking Comments at the bottom of the post. Thanks in advance.
How to get Wireshark to see usbmon0?
 ls -l /dev/usbmon shows 
crw-r--r-- 1 root root 248, 0 Jan 10 14:50 /dev/usbmon0 
crw-r--r-- 1 root root 248, 1 Jan 10 14:50 /dev/usbmon1 
crw-r--r-- 1 root root 248, 2 Jan 10 14:50 /dev/usbmon2
but Wireshark only sees the latter two.
1. We have a piece of boat gear (RayMarine C120W) that bridges NMEA 0183 (ASCII) and EtherNet ("SeaTalk-HS") data for transmission to Windows software (RayTech Navigation System—RNS). The bridged data are wired to a DB-9F chassis connector near the laptop. We did have a Serial to Ethernet cable that connected to an older laptop running the software that had an Ethernet Socket. It worked fine.
2. We have not touched the boat wiring, but have lost the cable and necessarily moved the software to a new laptop (openSUSE Leap 42.1 Linux) that does not have an Ethernet socket, only USB.
3. We have a Gigaware 2603487 USB-A to Serial Cable. It is recognized by the laptop and connected to ttyUSB0. We can read that port at the command line interface—CLI—with cat /dev/ttyUSB0 and see the NMEA 0183 ASCII sentences but not the Ethernet stream. 
3.1 I understand that the EtherNet traffic is higher frequency and multiplexed, yada yada, so will address that aspect ("EtherNet over USB") in due course, but first we need Wireshark to see the basic USB data that we can see on the CLI (presumably on usbmon0) to ensure that Wireshark is reading the USB connection.
4. We have laboriously followed and many of its adherents, particularly — yes, they misspelled Wireshark. As a result we have:
4.1 Sorted out usbmon. It needs to be restarted after each reboot (modprobe usbmon), a PITA we'll address later.
4.2 Added the requisite capabilities to dumpcap
4.3 Changed permissions as directed (644) on /dev/usbmon*, added the wireshark group and added the user to the group.
4.4. Configured Wireshark for non-root use, but that shows the same results as running it as root (yes, I know, a no-no).
5. says the special "usbmon0" interface receives events from all USB buses.
5.1 After a new modprobe usbmon after a reboot ls -l /dev/usbmon* returns
crw-r--r-- 1 root root 248, 0 Jan 10 14:50 /dev/usbmon0 
crw-r--r-- 1 root root 248, 1 Jan 10 14:50 /dev/usbmon1 
crw-r--r-- 1 root root 248, 2 Jan 10 14:50 /dev/usbmon2
so others (user, wireshark group) should be able to read.
5.2 So indeed usbmon0 exists but it does not appear in Wireshark. Wireshark only shows usbmon1 and usbmon2.  Neither has any interesting traffic, certainly not the ASCII stream that we can see on the CLI.
6. We have attempted using a USB connected EtherNet to USB adapter with a Serial to Ethernet cable. It is recognized by the OS and Wireshark sees it as eth0 but there is zero traffic on it.
We can proceed further with EtherNet over USB once we have determined that Wireshark can read usbmon0 (ttyUSB0).
How to get Wireshark to see usbmon0?