Robust online calculation of means, variances, and covariance¶
nbi_stat
defines a number of methods for robust calculation of
the sample means, variances, and covariance.
These algorithms provide
A single pass through data to calculate mean and (co)variance. This also allows for observations to be discarded as soon as we are done with them  that is, we do not need to keep the whole sample around to calculate the statistics.
Stablity with respect to rounding errors normally endured by singlepass calculations of the (co)variance.
The “challenge”¶
A typical single pass calculation of the variance is done by calculating the sum and sum of squares, and then evaluate
where we may sum incrementally. However, although mathematically correct, suffers from catastrophic cancellation due to finite machine precision, as illustrated in the example below
>>> def var(x,ddof=1):
... "Variance by difference  ddof=1 for unbiased"
... n = len(x)
... sum_x = sum(x)
... sum_x2 = sum([xx**2 for xx in x])
... avg_x = sum_x / n
... avg_x2 = sum_x2 / n
... return n / (nddof) * (avg_x2  avg_x**2)
...
>>> x = [4,7,13,16]
>>> y = [xx + 1e8 for xx in x]
>>>
>>> print(f'Variance of x: {var(x)}')
Variance of x: 30.0
>>> print(f'Variance of y: {var(y)}')
Variance of y: 29.333333333333332
Since for \(c\) a constant
we see that the calculation above fails. Note, that the data above is sorted, meaning we will gain nothing from the traditional heuristic of sorting the data first. A typical remedy is to require a twopass calculation
where we first calculate the sample average, and then calculate the sample average over the square residuals \(r_i=x_i\overline{x}\).
>>> def var(x,ddof=1):
... "Variance by difference  ddof=1 for unbiased"
... n = len(x)
... sum_x = sum(x)
... avg_x = sum_x / n
... sum_r2 = sum([(xxavg_x)**2 for xx in x])
... return 1 / (nddof) * sum_r2
...
>>> x = [4,7,13,16]
>>> y = [xx + 1e8 for xx in x]
>>>
>>> print(f'Variance of x: {var(x)}')
Variance of x: 30.0
>>> print(f'Variance of y: {var(y)}')
Variance of y: 30.0
This calculation works as we expect, but requires two passes. This is how, for example, NumPy calculates the variance, and thus require that the whole sample is avaiable
The algorithms present here alleviate both of these problems
No catastrophic cancellation since numbers of equal magnitude are added at all times
Single pass over data, allowing us to free up resources as quickly as possible.
Unweighted samples¶
For unweighted samples of 1 or more dimensions, we use the function
nbi_stat.welford_update()
.
For example, for a 1 dimensional sample, we can do
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "../nbi_stat.py", line 1005, in welford_update
cv = zeros(covar.shape)
AttributeError: 'int' object has no attribute 'shape'
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: 'int' object is not subscriptable
Note, nbi_stat.welford_update()
(and it’s sibling
nbi_stat.west_update()
) always returns arrays, even if the
states were initially declared as scalars.
The function nbi_stat.welford_init()
can help us declare our
data structure. It needs the number of dimensions, and whether to
calculate the covariance matrix or just the variances.
For example, for a 3dimensional sample where we calculate means and variances, we can do
>>> state = nbi_stat.welford_init(3)
>>> for _ in range(100):
... state = nbi_stat.welford_update(numpy.random.normal(size=3),*state)
...
>>> print(state)
(array([0.22974946, 0.04478776, 0.07586245]), array([1.17592224, 1.05133256, 1.17284443]), 100)
To also calculate the covariances, we must pass covar=True
to
nbi_stat.welford_init()
.
>>> state = nbi_stat.welford_init(3,covar=True)
>>> for _ in range(100):
... state = nbi_stat.welford_update(numpy.random.normal(size=3),*state)
...
>>> print(state)
(array([0.22974946, 0.04478776, 0.07586245]), array([[1.17592224, 0.02753376, 0.05772808],
[0.02753376, 1.05133256, 0.04300817],
[0.05772808, 0.04300817, 1.17284443]]), 100)
Note, nbi_stat.welford_init()
always returns 3 variables
mean
 the meanscv
 (co)variancesn
 Counter
If \(N\) is the number of dimensions (variables) declared, then
mean
is \(N\) vector, while cv
is either \(N\) or
\(N\times N\) if covar
is False
or True
,
respectively. n
is always scalar.
Merging statistics¶
Suppose we have calculated the means and (co)variances of sample
\(A\) and \(B\), and now we want to combine these to into
one. We can do that using nbi_stat.welford_merge()
>>>
>>> a = nbi_stat.welford_init(3,covar=True)
>>> b = nbi_stat.welford_init(3,covar=True)
>>>
>>> for _ in range(100):
... a = nbi_stat.welford_update(numpy.random.normal(size=3),*a)
... b = nbi_stat.welford_update(numpy.random.normal(size=3),*b)
...
>>> print(a)
(array([0.16815221, 0.036316 , 0.09186943]), array([[ 1.23957299, 0.00317658, 0.0066787 ],
[0.00317658, 0.85353364, 0.11352925],
[ 0.0066787 , 0.11352925, 1.23276861]]), 100)
>>> print(b)
(array([0.07224508, 0.14834497, 0.02915475]), array([[1.1096925 , 0.10538353, 0.20331521],
[0.10538353, 1.21522231, 0.03345735],
[0.20331521, 0.03345735, 1.08457403]]), 100)
>>>
>>> c = nbi_stat.welford_merge(*a,*b)
>>> print(c)
(array([0.12019865, 0.09233048, 0.06051209]), array([[1.17693229, 0.04841738, 0.10650065],
[0.04841738, 1.0375156 , 0.07173684],
[0.10650065, 0.07173684, 1.1596546 ]]), 200)
Object oriented interface¶
The class nbi_stat.Welford
provides an object oriented
interface to the Welford algorithm.
>>>
>>> w1 = nbi_stat.Welford(3,covar=True)
>>> w2 = nbi_stat.Welford(3,covar=True)
>>>
>>> for _ in range(100):
... w1 += numpy.random.normal(size=3)
... w2 += numpy.random.normal(size=3)
...
>>> print(f'Sample 1 {w1.n} (mean,sem,covariance)\n{w1}')
Sample 1 100 (mean,sem,covariance)
[[0.16815221 0.11133611 1.23957299 0.00317658 0.0066787 ]
[0.036316 0.09238688 0.00317658 0.85353364 0.11352925]
[0.09186943 0.11103011 0.0066787 0.11352925 1.23276861]]
>>> print(f'Sample 2 {w2.n} (mean,sem,covariance)\n{w2}')
Sample 2 100 (mean,sem,covariance)
[[0.07224508 0.10534194 1.1096925 0.10538353 0.20331521]
[0.14834497 0.11023712 0.10538353 1.21522231 0.03345735]
[0.02915475 0.10414288 0.20331521 0.03345735 1.08457403]]
>>>
>>> w3 = w1 + w2
>>>
>>> print(f'Merged sample {w3.n} (mean,sem,covariance)\n{w3}')
Merged sample 200 (mean,sem,covariance)
[[0.12019865 0.07671155 1.17693229 0.04841738 0.10650065]
[0.09233048 0.07202484 0.04841738 1.0375156 0.07173684]
[0.06051209 0.07614639 0.10650065 0.07173684 1.1596546 ]]
Note, we can query the object for the standard error on the mean
(SEM) using the method nbi_stat.Welford.sem()
.
Weighted samples¶
Weighted samples are samples of 1 or more dimensions where each
observation has a weight. Weights come in different flavours though,
and the nbi_stat.west_update()
supports them all.
Here, we will use the function nbi_stat.west_init()
to
initialise our data structure, but as we saw above, it is also
possible to declare the variables directly.
The function nbi_stat.west_init()
always returns 5 variables
mean
 meanscv
 (co)variancessumw
 sum of weightssumw2
 sum of square weights (possiblyNone
)summw
 sum of mean weights (possiblyNone
)
and these are always arrays. The dimensionality of these variables depends on the various parameters given to :nbi_stat:`west_init`
Frequency and nonfrequency weights¶
Weights primarily comes in 2 flavours: frequency and nonfrequency weights.
Frequency weights are positive whole numbers and represents the number of times a given value as seen. That is,
>>> mean,var,sumw,_ = nbi_stat.west_update(1,3,mean,var,sumw)
and
>>> mean,var,n = nbi_stat.welford_update(1,mean,var,n) >>> mean,var,n = nbi_stat.welford_update(1,mean,var,n) >>> mean,var,n = nbi_stat.welford_update(1,mean,var,n)
are equivalent. This kind of weights is assumed if no
sumw2
isNone
. However, here we will usenbi_stat.west_init()
to help us out.>>> state = nbi_stat.west_init(1,frequency=True) >>> for _ in range(100): ... state = nbi_stat.west_update(numpy.random.normal(), ... numpy.random.poisson(3),*state) ... >>> mean, var, sumw, *_ = state >>> print(f'Mean {mean[0]:.3f}\n' ... f'Variance {var[0]:.3f}\n' ... f'Sum weights {sumw[0]:.3f}') Mean 0.098 Variance 1.113 Sum weights 311.000
Nonfrequency weights are positive, real numbers. These may represent, for example efficiencies, acceptance, fiducial values, or other such quantities. This interpretation is triggered by passing a
sumw2
argument tonbi_stat.west_update()
orfrequency=False
tonbi_stat.west_init()
File "<input>", line 1 f'Sum squaredweights {sumw2[0]:.3f}') ^ IndentationError: unexpected indent
Multidimensional samples¶
We can also use nbi_stat.welford_update()
and
nbi_stat.west_update()
to calculate the individual component
(weighted) means and (co)variances of a multidimensional sample. To
do that, we simply pass an array of observations and appropriate sized
states.
We can, as for the 1dimensional case, distinguish between frequency
and nonfrequency weights by not specifying or specifying the
sumw2
argument or passing the frequency
keyword to
nbi_stat.west_init()
.
Observation and component weights¶
However, for multidimensional weighted samples, we also (that is, in addition) distinguish between observation and component (or element) weights.
Observation weights are either frequency or nonfrequency weights that apply to the “event” as a whole. That is, if all components of a given “event” have the same weight, we pass a single value for the weight.
Let’s take an example of frequency, observation weights.
>>> state = nbi_stat.west_init(2,frequency=True,component=False) >>> >>> for _ in range(100): ... state = nbi_stat.west_update(numpy.random.normal(size=2), ... numpy.random.poisson(3),*state) ... >>> mean, var, sumw, *_ = state >>> >>> numpy.set_printoptions(precision=3, suppress=True) >>> print(f'Mean {mean}\n' ... f'Variance: {var}\n' ... f'Sum weights: {sumw[0]:.3f}') Mean [0.056 0.039] Variance: [0.964 1.153] Sum weights: 299.000
To enable nonfrequency weights we need to pass
sumw2
. To make life easier,nbi_stat.west_init()
helps us get the data structure right>>> state = nbi_stat.west_init(2,frequency=False,component=False) >>> >>> for _ in range(100): ... state = nbi_stat.west_update(numpy.random.normal(size=2), ... numpy.random.random(),*state) ... >>> mean, var, sumw, sumw2, _ = state >>> numpy.set_printoptions(precision=3, suppress=True) >>> print(f'Mean {mean}\n' ... f'Variance: {var}\n' ... f'Sum weights: {sumw[0]:.3f}\n' ... f'Sum squaredweights: {sumw2[0]:.3f}') Mean [0.037 0.046] Variance: [1.129 1.204] Sum weights: 46.243 Sum squaredweights: 29.393
Component weights are either frequency or nonfrequency weights that apply to the individual components (or elements) of an “event”. These are then passed as an array to
nbi_stat.west_update()
.First, let’s do frequency, component weights. We need to pass an appropriate sized state for
sumw
to enable this, and we of course need to give as many weights as there are components in the observation.>>> state = nbi_stat.west_init(2,frequency=True,component=True) >>> >>> for _ in range(100): ... state = nbi_stat.west_update(numpy.random.normal(size=2), ... numpy.random.poisson(3,size=2),*state) ... >>> mean, var, sumw, *_ = state >>> numpy.set_printoptions(precision=3, suppress=True) >>> print(f'Mean {mean}\n' ... f'Variance: {var}\n' ... f'Sum weights: {sumw}') Mean [0.199 0.075] Variance: [1.074 0.839] Sum weights: [313 342]
For nonfrequency, component weights we also need to pass an appropriately sized
sumw2
>>> state = nbi_stat.west_init(2,frequency=False,component=True) >>> for _ in range(100): ... state = nbi_stat.west_update(numpy.random.normal(size=2), ... numpy.random.random(size=2),*state) ... >>> mean, var, sumw, sumw2, _ = state >>> numpy.set_printoptions(precision=3, suppress=True) >>> print(f'Mean {mean}\n' ... f'Variance: {var}\n' ... f'Sum weights: {sumw}\n' ... f'Sum squaredweights: {sumw2}') Mean [0.273 0.089] Variance: [1.527 0.787] Sum weights: [50.67 48.474] Sum squaredweights: [33.09 31.808]
Thus, for multidimensional samples, we have the following matrix of weight types
Frequency 
Nonfrequency 

Observation 


Component 


Covariance¶
Here, we distinguish between observation and component weights, as well as frequency and nonfrequency weights (see also Weight type combinations for multidimensional samples and surrounding discussion).
Let’s take as the single example nonfrequency, component weights
and calculate the means and covariance. For this particular case, we
need to pass summw
 an array of the size of the input. Remember,
that the variance is of each component is the corresponding diagonal
element of the covariance matrix
>>> state = nbi_stat.west_init(2,covar=True,frequency=False,component=True)
>>> for _ in range(100):
... state = nbi_stat.west_update(numpy.random.normal(size=2),
... numpy.random.random(size=2),
... *state)
...
>>> mean, covar, sumw, sumw2, summw = state
>>> numpy.set_printoptions(precision=3, suppress=True)
>>> print(f'Means: {mean}\n'
... f'Covariances:\n{covar}\n'
... f'Sum weights:\n{sumw}\n'
... f'Sum squared weights:\n{sumw2}\n'
... f'Sum mean weights:\n{summw}')
Means: [0.273 0.089]
Covariances:
[[ 1.698 0.055]
[0.051 0.785]]
Sum weights:
[[33.09 22.565]
[22.565 31.808]]
Sum squared weights:
[[19.216 8.703]
[ 8.703 19.053]]
Sum mean weights:
[50.67 48.474]
Note, for component weights, the sum of weights and squaredweights
are \(N\times N\) matrices, while summw
is an \(N\)
vector.
Merging sample statistics¶
Suppose we have calculated the means and (co)variances of sample
\(A\) and \(B\), and now we want to combine these to into
one. We can do that using nbi_stat.west_merge()
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: west_merge() missing 1 required positional argument: 'ddof'
Traceback (most recent call last):
File "<input>", line 1, in <module>
NameError: name 'c' is not defined
Object oriented interface¶
The class nbi_stat.West
provides an object oriented
interface to the West algorithm.
>>>
>>> opts = dict(frequency=True,component=False,covar=True)
>>> w1 = nbi_stat.West(3,**opts)
>>> w2 = nbi_stat.West(3,**opts)
>>>
>>> def obs():
... return numpy.random.normal(size=3),numpy.random.poisson(3)
...
>>> for _ in range(100):
... w1 += obs()
... w2 += obs()
...
>>> print(f'Sample 1 {w1.sumw} (mean,sem,covariance)\n{w1}')
Sample 1 [297] (mean,sem,covariance)
[[0.09971652 0.05150481 0.78786544 0.12346418 0.02198087]
[ 0.09953701 0.05626693 0.12346418 0.94029228 0.13652254]
[ 0.25089228 0.05895792 0.02198087 0.13652254 1.03238271]]
>>> print(f'Sample 2 {w2.sumw} (mean,sem,covariance)\n{w2}')
Sample 2 [323] (mean,sem,covariance)
[[0.10590161 0.06253401 1.2630922 0.15755895 0.24979371]
[ 0.01239253 0.05816508 0.15755895 1.09276589 0.14111137]
[0.06433251 0.05355152 0.24979371 0.14111137 0.92628827]]
>>>
>>> w3 = w1 + w2
>>>
>>> print(f'Merged sample {w3.sumw} (mean,sem,covariance)\n{w3}')
Merged sample [620] (mean,sem,covariance)
[[0.10293875 0.04086668 1.0354528 0.02307432 0.14017745]
[ 0.05413755 0.04059281 0.02307432 1.02162132 0.00126032]
[ 0.08667033 0.04019928 0.14017745 0.00126032 1.00190891]]
Note, we can query the object for the standard error on the mean
(SEM) using the method nbi_stat.West.sem()
.