Statistics

Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: 'There are three kinds of lies: lies, damned lies and statistics.'

Mark Twain

Once we have actually done experiments we will need a mathematical way of assessing the reliability of our results. It is the reliability of the results which requires that scientists to use statistics.

There are three terms that are used by scientists in relation to their data's reliability. They are accuracy, precision and error.

Accuracy is how close a measured value is to the true, or accepted, value, while precision is how carefully a single measurement was made or how reproducible measurements in a series are. The terms accuracy and precision are not synonymous, but they are related, as we will see. Error is anything that lessens a measurement's accuracy or its precision.

Accuracy and Precision
Region 10 ESC (YouTube)

How can we determine Accuracy?

The easiest way to show how accurate a measured value may be is to calculate the percent error using the known or true value:

$$\text{Percent Error} = {\text{measured value} - \text{true value} \over \text{true value}} \times 100 \%$$

Percent error is often shown as an absolute value but in the case of the laboratory experiments we will be conducting the values may be positive if the experimental value is less than the known or true value or negative if the experimental value is greater than the known or true value. In this manner, the sign (positive or negative) yields more information about the experimental to true value relationship.

When the true value is not known, no conclusion about accuracy may be made using a percent error. In this case, standards must be run or other statistical methods based on the precision can be used.

A large set of values for the same measurement generally improves the likelihood that the average value of this data set is correct or true.

The arithmetic average ($\mu$) is calculated simply by adding all of the values of a repeated measurement (x) together and then dividing by the number of measurements added (n).

$$\mu = \sum x / n$$

The precision of the data that you collect is best described by a statistic called the Standard Deviation (S or s).

The Standard Deviation explained
Blahzinga (YouTube)

Standard Deviation (S) is a measure that is used to quantify the amount of variation or dispersion (how spread out) of a set of data values. A small standard deviation means that the values are all closely grouped together and therefore more precise. A large standard deviation means the values are not very similar and therefore less precise.

To calculate the standard deviation you need a population of values and the average (or mean) of those values.

\begin{align} s & = \sqrt {\sum (x - \bar{x})^2 \over N - 1} \\ \text{where} \\ s & = \text{the standard deviation} \\ x & = \text{each value in the sample} \\ \bar{x} & = \text{the mean of the values} \\ N & = \text{the number of values (the sample size)} \end{align}

Example

Here is some sample data:

$$4\quad2\quad5\quad8\quad6$$

Now, let's calculate the standard deviation:

1. Calculate the mean:

\begin{align} \bar{x} & = {\sum x \over N} \\ & = {x_1 + x_2 + \dotsb + x_N \over N} \\ & = {4 + 2 + 5 + 8 + 6 \over 5} \\ & = 5 \end{align}

2. Subtract the mean from each value in the sample:

\begin{align} x_1 & - \bar{x} = 4 - 5 = - 1 \\ x_2 & - \bar{x} = 2 - 5 = - 3 \\ x_3 & - \bar{x} = 5 - 5 = - 0 \\ x_4 & - \bar{x} = 8 - 5 = - 3 \\ x_5 & - \bar{x} = 6 - 5 = - 1 \end{align}

3. Calculate the Sum of each value squared:

\begin{align} \sum (x - \bar{x})^2 \colon \\ \sum (x - \bar{x})^2 & = (x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dotsb + (x_N - \bar{x})^2 \\ & = (-1)^2 + (-3)^2 + 0^2 + 3^2 + 1^2 \\ & = 20 \end{align}

4. Divide the value by the number of values used minus one and take the square root:

\begin{align} s & = \sqrt {\sum (x - \bar{x})^2 \over N - 1} \\ & = \sqrt {20 \over 5 - 1} \\ & = 2.24 \end{align}

The standard deviation for the data is 2.24