Robust online calculation of means, variances, and covariance

nbi_stat defines a number of methods for robust calculation of the sample means, variances, and covariance.

These algorithms provide

  • A single pass through data to calculate mean and (co)variance. This also allows for observations to be discarded as soon as we are done with them - that is, we do not need to keep the whole sample around to calculate the statistics.

  • Stablity with respect to rounding errors normally endured by single-pass calculations of the (co)variance.

The “challenge”

A typical single pass calculation of the variance is done by calculating the sum and sum of squares, and then evaluate

\[s^{2}_{x} = \frac{N}{N-\delta}\left[\frac{1}{N}\sum_{i=1}^{N} x_{i}^{2} - \left(\frac{1}{N}\sum_{i=1}^{N} x_{i}\right)^2\right]\]

where we may sum incrementally. However, although mathematically correct, suffers from catastrophic cancellation due to finite machine precision, as illustrated in the example below

>>> def var(x,ddof=1):
...     "Variance by difference - ddof=1 for unbiased"
...     n      = len(x)
...     sum_x  = sum(x)
...     sum_x2 = sum([xx**2 for xx in x])
...     avg_x  = sum_x / n
...     avg_x2 = sum_x2 / n
...     return n / (n-ddof) * (avg_x2 - avg_x**2)
... 
>>> x = [4,7,13,16]
>>> y = [xx + 1e8 for xx in x]
>>> 
>>> print(f'Variance of x: {var(x)}')
Variance of x: 30.0
>>> print(f'Variance of y: {var(y)}')
Variance of y: 29.333333333333332

Since for \(c\) a constant

\[\mathbb{V}[x+c] = \mathbb{V}[x]\]

we see that the calculation above fails. Note, that the data above is sorted, meaning we will gain nothing from the traditional heuristic of sorting the data first. A typical remedy is to require a two-pass calculation

\[s_{x}^{2} = \frac{1}{N-\delta}\sum_{i=1}^{N} (x_{i} - \overline{x})^{2} \qquad\overline{x} = \frac{1}{N}\sum_{i=1}^{N} x_{i}\]

where we first calculate the sample average, and then calculate the sample average over the square residuals \(r_i=x_i-\overline{x}\).

>>> def var(x,ddof=1):
...     "Variance by difference - ddof=1 for unbiased"
...     n      = len(x)
...     sum_x  = sum(x)
...     avg_x  = sum_x / n
...     sum_r2 = sum([(xx-avg_x)**2 for xx in x])
...     return 1 / (n-ddof) * sum_r2
... 
>>> x = [4,7,13,16]
>>> y = [xx + 1e8 for xx in x]
>>> 
>>> print(f'Variance of x: {var(x)}')
Variance of x: 30.0
>>> print(f'Variance of y: {var(y)}')
Variance of y: 30.0

This calculation works as we expect, but requires two passes. This is how, for example, NumPy calculates the variance, and thus require that the whole sample is avaiable

The algorithms present here alleviate both of these problems

  • No catastrophic cancellation since numbers of equal magnitude are added at all times

  • Single pass over data, allowing us to free up resources as quickly as possible.

Unweighted samples

For unweighted samples of 1 or more dimensions, we use the function nbi_stat.welford_update().

For example, for a 1 dimensional sample, we can do

Traceback (most recent call last):
  File "<input>", line 3, in <module>
  File "../nbi_stat.py", line 1005, in welford_update
    cv = zeros(covar.shape)
AttributeError: 'int' object has no attribute 'shape'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: 'int' object is not subscriptable

Note, nbi_stat.welford_update() (and it’s sibling nbi_stat.west_update()) always returns arrays, even if the states were initially declared as scalars.

The function nbi_stat.welford_init() can help us declare our data structure. It needs the number of dimensions, and whether to calculate the covariance matrix or just the variances.

For example, for a 3-dimensional sample where we calculate means and variances, we can do

>>> state = nbi_stat.welford_init(3)
>>> for _ in range(100):
...     state = nbi_stat.welford_update(numpy.random.normal(size=3),*state)
... 
>>> print(state)
(array([-0.22974946,  0.04478776,  0.07586245]), array([1.17592224, 1.05133256, 1.17284443]), 100)

To also calculate the covariances, we must pass covar=True to nbi_stat.welford_init().

>>> state = nbi_stat.welford_init(3,covar=True)
>>> for _ in range(100):
...     state = nbi_stat.welford_update(numpy.random.normal(size=3),*state)
... 
>>> print(state)
(array([-0.22974946,  0.04478776,  0.07586245]), array([[1.17592224, 0.02753376, 0.05772808],
       [0.02753376, 1.05133256, 0.04300817],
       [0.05772808, 0.04300817, 1.17284443]]), 100)

Note, nbi_stat.welford_init() always returns 3 variables

  • mean - the means

  • cv - (co)variances

  • n - Counter

If \(N\) is the number of dimensions (variables) declared, then mean is \(N\) vector, while cv is either \(N\) or \(N\times N\) if covar is False or True, respectively. n is always scalar.

Merging statistics

Suppose we have calculated the means and (co)variances of sample \(A\) and \(B\), and now we want to combine these to into one. We can do that using nbi_stat.welford_merge()

>>> 
>>> a = nbi_stat.welford_init(3,covar=True)
>>> b = nbi_stat.welford_init(3,covar=True)
>>> 
>>> for _ in range(100):
...     a = nbi_stat.welford_update(numpy.random.normal(size=3),*a)
...     b = nbi_stat.welford_update(numpy.random.normal(size=3),*b)
... 
>>> print(a)
(array([-0.16815221, -0.036316  , -0.09186943]), array([[ 1.23957299, -0.00317658,  0.0066787 ],
       [-0.00317658,  0.85353364,  0.11352925],
       [ 0.0066787 ,  0.11352925,  1.23276861]]), 100)
>>> print(b)
(array([-0.07224508, -0.14834497, -0.02915475]), array([[1.1096925 , 0.10538353, 0.20331521],
       [0.10538353, 1.21522231, 0.03345735],
       [0.20331521, 0.03345735, 1.08457403]]), 100)
>>> 
>>> c = nbi_stat.welford_merge(*a,*b)
>>> print(c)
(array([-0.12019865, -0.09233048, -0.06051209]), array([[1.17693229, 0.04841738, 0.10650065],
       [0.04841738, 1.0375156 , 0.07173684],
       [0.10650065, 0.07173684, 1.1596546 ]]), 200)

Object oriented interface

The class nbi_stat.Welford provides an object oriented interface to the Welford algorithm.

>>> 
>>> w1 = nbi_stat.Welford(3,covar=True)
>>> w2 = nbi_stat.Welford(3,covar=True)
>>> 
>>> for _ in range(100):
...     w1 += numpy.random.normal(size=3)
...     w2 += numpy.random.normal(size=3)
... 
>>> print(f'Sample 1 {w1.n} (mean,sem,covariance)\n{w1}')
Sample 1 100 (mean,sem,covariance)
[[-0.16815221  0.11133611  1.23957299 -0.00317658  0.0066787 ]
 [-0.036316    0.09238688 -0.00317658  0.85353364  0.11352925]
 [-0.09186943  0.11103011  0.0066787   0.11352925  1.23276861]]
>>> print(f'Sample 2 {w2.n} (mean,sem,covariance)\n{w2}')
Sample 2 100 (mean,sem,covariance)
[[-0.07224508  0.10534194  1.1096925   0.10538353  0.20331521]
 [-0.14834497  0.11023712  0.10538353  1.21522231  0.03345735]
 [-0.02915475  0.10414288  0.20331521  0.03345735  1.08457403]]
>>> 
>>> w3 = w1 + w2
>>> 
>>> print(f'Merged sample {w3.n} (mean,sem,covariance)\n{w3}')
Merged sample 200 (mean,sem,covariance)
[[-0.12019865  0.07671155  1.17693229  0.04841738  0.10650065]
 [-0.09233048  0.07202484  0.04841738  1.0375156   0.07173684]
 [-0.06051209  0.07614639  0.10650065  0.07173684  1.1596546 ]]

Note, we can query the object for the standard error on the mean (SEM) using the method nbi_stat.Welford.sem().

Weighted samples

Weighted samples are samples of 1 or more dimensions where each observation has a weight. Weights come in different flavours though, and the nbi_stat.west_update() supports them all.

Here, we will use the function nbi_stat.west_init() to initialise our data structure, but as we saw above, it is also possible to declare the variables directly.

The function nbi_stat.west_init() always returns 5 variables

  • mean - means

  • cv - (co)variances

  • sumw - sum of weights

  • sumw2 - sum of square weights (possibly None)

  • summw - sum of mean weights (possibly None)

and these are always arrays. The dimensionality of these variables depends on the various parameters given to :nbi_stat:`west_init`

Frequency and non-frequency weights

Weights primarily comes in 2 flavours: frequency and non-frequency weights.

  • Frequency weights are positive whole numbers and represents the number of times a given value as seen. That is,

    >>> mean,var,sumw,_ = nbi_stat.west_update(1,3,mean,var,sumw)
    

    and

    >>> mean,var,n = nbi_stat.welford_update(1,mean,var,n)
    >>> mean,var,n = nbi_stat.welford_update(1,mean,var,n)
    >>> mean,var,n = nbi_stat.welford_update(1,mean,var,n)
    

    are equivalent. This kind of weights is assumed if no sumw2 is None. However, here we will use nbi_stat.west_init() to help us out.

    >>> state = nbi_stat.west_init(1,frequency=True)
    >>> for _ in range(100):
    ...     state = nbi_stat.west_update(numpy.random.normal(),
    ...                                  numpy.random.poisson(3),*state)
    ... 
    >>> mean, var, sumw, *_ = state
    >>> print(f'Mean        {mean[0]:.3f}\n'
    ...       f'Variance    {var[0]:.3f}\n'
    ...       f'Sum weights {sumw[0]:.3f}')
    Mean        -0.098
    Variance    1.113
    Sum weights 311.000
    
  • Non-frequency weights are positive, real numbers. These may represent, for example efficiencies, acceptance, fiducial values, or other such quantities. This interpretation is triggered by passing a sumw2 argument to nbi_stat.west_update() or frequency=False to nbi_stat.west_init()

      File "<input>", line 1
        f'Sum squared-weights {sumw2[0]:.3f}')
        ^
    IndentationError: unexpected indent
    

Multi-dimensional samples

We can also use nbi_stat.welford_update() and nbi_stat.west_update() to calculate the individual component (weighted) means and (co)variances of a multi-dimensional sample. To do that, we simply pass an array of observations and appropriate sized states.

We can, as for the 1-dimensional case, distinguish between frequency and non-frequency weights by not specifying or specifying the sumw2 argument or passing the frequency keyword to nbi_stat.west_init().

Observation and component weights

However, for multi-dimensional weighted samples, we also (that is, in addition) distinguish between observation and component (or element) weights.

  • Observation weights are either frequency or non-frequency weights that apply to the “event” as a whole. That is, if all components of a given “event” have the same weight, we pass a single value for the weight.

    Let’s take an example of frequency, observation weights.

    >>> state = nbi_stat.west_init(2,frequency=True,component=False)
    >>> 
    >>> for _ in range(100):
    ...     state = nbi_stat.west_update(numpy.random.normal(size=2),
    ...                                  numpy.random.poisson(3),*state)
    ... 
    >>> mean, var, sumw, *_ = state
    >>> 
    >>> numpy.set_printoptions(precision=3, suppress=True)
    >>> print(f'Mean         {mean}\n'
    ...       f'Variance:    {var}\n'
    ...       f'Sum weights: {sumw[0]:.3f}')
    Mean         [-0.056  0.039]
    Variance:    [0.964 1.153]
    Sum weights: 299.000
    

    To enable non-frequency weights we need to pass sumw2. To make life easier, nbi_stat.west_init() helps us get the data structure right

    >>> state = nbi_stat.west_init(2,frequency=False,component=False)
    >>> 
    >>> for _ in range(100):
    ...     state = nbi_stat.west_update(numpy.random.normal(size=2),
    ...                                  numpy.random.random(),*state)
    ... 
    >>> mean, var, sumw, sumw2, _ = state
    >>> numpy.set_printoptions(precision=3, suppress=True)
    >>> print(f'Mean                 {mean}\n'
    ...       f'Variance:            {var}\n'
    ...       f'Sum weights:         {sumw[0]:.3f}\n'
    ...       f'Sum squared-weights: {sumw2[0]:.3f}')
    Mean                 [-0.037  0.046]
    Variance:            [1.129 1.204]
    Sum weights:         46.243
    Sum squared-weights: 29.393
    
  • Component weights are either frequency or non-frequency weights that apply to the individual components (or elements) of an “event”. These are then passed as an array to nbi_stat.west_update().

    First, let’s do frequency, component weights. We need to pass an appropriate sized state for sumw to enable this, and we of course need to give as many weights as there are components in the observation.

    >>> state = nbi_stat.west_init(2,frequency=True,component=True)
    >>> 
    >>> for _ in range(100):
    ...     state = nbi_stat.west_update(numpy.random.normal(size=2),
    ...                                  numpy.random.poisson(3,size=2),*state)
    ... 
    >>> mean, var, sumw, *_ = state
    >>> numpy.set_printoptions(precision=3, suppress=True)
    >>> print(f'Mean         {mean}\n'
    ...       f'Variance:    {var}\n'
    ...       f'Sum weights: {sumw}')
    Mean         [-0.199  0.075]
    Variance:    [1.074 0.839]
    Sum weights: [313 342]
    

    For non-frequency, component weights we also need to pass an appropriately sized sumw2

    >>> state = nbi_stat.west_init(2,frequency=False,component=True)
    >>> for _ in range(100):
    ...     state = nbi_stat.west_update(numpy.random.normal(size=2),
    ...                                  numpy.random.random(size=2),*state)
    ... 
    >>> mean, var, sumw, sumw2, _ = state
    >>> numpy.set_printoptions(precision=3, suppress=True)
    >>> print(f'Mean                 {mean}\n'
    ...       f'Variance:            {var}\n'
    ...       f'Sum weights:         {sumw}\n'
    ...       f'Sum squared-weights: {sumw2}')
    Mean                 [-0.273 -0.089]
    Variance:            [1.527 0.787]
    Sum weights:         [50.67  48.474]
    Sum squared-weights: [33.09  31.808]
    

Thus, for multi-dimensional samples, we have the following matrix of weight types

Weight type combinations for multi-dimensional samples

Frequency

Non-frequency

Observation

  • sumw scalar

  • sumw2=None

  • sumw scalar

  • sumw2 scalar

Component

  • sumw array

  • sumw2=None

  • sumw array

  • sumw2 array

Covariance

Here, we distinguish between observation and component weights, as well as frequency and non-frequency weights (see also Weight type combinations for multi-dimensional samples and surrounding discussion).

Let’s take as the single example non-frequency, component weights and calculate the means and covariance. For this particular case, we need to pass summw - an array of the size of the input. Remember, that the variance is of each component is the corresponding diagonal element of the covariance matrix

>>> state = nbi_stat.west_init(2,covar=True,frequency=False,component=True)
>>> for _ in range(100):
...     state = nbi_stat.west_update(numpy.random.normal(size=2),
...                                  numpy.random.random(size=2),
...                                  *state)
... 
>>> mean, covar, sumw, sumw2, summw = state
>>> numpy.set_printoptions(precision=3, suppress=True)
>>> print(f'Means: {mean}\n'
...       f'Covariances:\n{covar}\n'
...       f'Sum weights:\n{sumw}\n'
...       f'Sum squared weights:\n{sumw2}\n'
...       f'Sum mean weights:\n{summw}')
Means: [-0.273 -0.089]
Covariances:
[[ 1.698 -0.055]
 [-0.051  0.785]]
Sum weights:
[[33.09  22.565]
 [22.565 31.808]]
Sum squared weights:
[[19.216  8.703]
 [ 8.703 19.053]]
Sum mean weights:
[50.67  48.474]

Note, for component weights, the sum of weights and squared-weights are \(N\times N\) matrices, while summw is an \(N\) vector.

Merging sample statistics

Suppose we have calculated the means and (co)variances of sample \(A\) and \(B\), and now we want to combine these to into one. We can do that using nbi_stat.west_merge()

Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: west_merge() missing 1 required positional argument: 'ddof'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
NameError: name 'c' is not defined

Object oriented interface

The class nbi_stat.West provides an object oriented interface to the West algorithm.

>>> 
>>> opts = dict(frequency=True,component=False,covar=True)
>>> w1 = nbi_stat.West(3,**opts)
>>> w2 = nbi_stat.West(3,**opts)
>>> 
>>> def obs():
...     return numpy.random.normal(size=3),numpy.random.poisson(3)
... 
>>> for _ in range(100):
...     w1 += obs()
...     w2 += obs()
... 
>>> print(f'Sample 1 {w1.sumw} (mean,sem,covariance)\n{w1}')
Sample 1 [297] (mean,sem,covariance)
[[-0.09971652  0.05150481  0.78786544 -0.12346418 -0.02198087]
 [ 0.09953701  0.05626693 -0.12346418  0.94029228  0.13652254]
 [ 0.25089228  0.05895792 -0.02198087  0.13652254  1.03238271]]
>>> print(f'Sample 2 {w2.sumw} (mean,sem,covariance)\n{w2}')
Sample 2 [323] (mean,sem,covariance)
[[-0.10590161  0.06253401  1.2630922   0.15755895 -0.24979371]
 [ 0.01239253  0.05816508  0.15755895  1.09276589 -0.14111137]
 [-0.06433251  0.05355152 -0.24979371 -0.14111137  0.92628827]]
>>> 
>>> w3 = w1 + w2
>>> 
>>> print(f'Merged sample {w3.sumw} (mean,sem,covariance)\n{w3}')
Merged sample [620] (mean,sem,covariance)
[[-0.10293875  0.04086668  1.0354528   0.02307432 -0.14017745]
 [ 0.05413755  0.04059281  0.02307432  1.02162132 -0.00126032]
 [ 0.08667033  0.04019928 -0.14017745 -0.00126032  1.00190891]]

Note, we can query the object for the standard error on the mean (SEM) using the method nbi_stat.West.sem().