A bag of numbers in C++ for constant time statistics queries - follow-up

Question

(See the previous iteration.)

(See the next iteration.)

I have a simple "data structure" for storing numbers and querying the statistics on them (average, variance and standard deviation).

Now I have made the bag data structure more versatile:

numberbag.hpp

#ifndef CODERODDE_STAT_NUMBER_BAG
#define CODERODDE_STAT_NUMBER_BAG

#include <cmath>
#include <cstddef>
#include <iostream>

namespace coderodde {
namespace stat {

template<typename FloatingPoint = long double>
class number_bag {
private:

    size_t        m_size;
    FloatingPoint m_sum;
    FloatingPoint m_square_sum;

public:

    number_bag(size_t size, FloatingPoint sum, FloatingPoint square_sum) :
        m_size{size},
        m_sum{sum},
        m_square_sum{square_sum} {}

    number_bag() : m_size{}, m_sum{}, m_square_sum{} {}

    number_bag(const number_bag<FloatingPoint>& other) :
        m_size{other.m_size},
        m_sum{other.m_sum},
        m_square_sum{other.m_square_sum} {}

    number_bag& operator=(const number_bag<FloatingPoint>& other) {
        this->m_size = other.m_size;
        this->m_sum = other.m_sum;
        this->m_square_sum = other.m_square_sum;
        return *this;
    }

    void add(FloatingPoint num) {
        m_size++;
        m_sum += num;
        m_square_sum += num * num;
    }

    void remove(FloatingPoint num) {
        m_size--;
        m_sum -= num;
        m_square_sum -= num * num;
    }

    void clear() {
        m_size = 0;
        m_sum = FloatingPoint{};
        m_square_sum = FloatingPoint{};
    }

    size_t size() const {
        return m_size;
    }

    FloatingPoint average() const {
        return m_sum / m_size;
    }

    FloatingPoint variance() const {
        FloatingPoint step1 = m_square_sum - (m_sum * m_sum) / m_size;
        return step1 / (m_size - 1);
    }

    FloatingPoint standard_deviation() const {
        return std::sqrt(variance());
    }

    template<typename FloatingPoint2>
    number_bag<decltype(FloatingPoint2{} + FloatingPoint{})>
    operator+(FloatingPoint2 fp) {
        using fp_type = decltype(FloatingPoint{} + FloatingPoint2{});
        number_bag<fp_type> ret(m_size, m_sum, m_square_sum);
        ret.add(fp);
        return ret;
    }

    template<typename FloatingPoint2>
    number_bag<decltype(FloatingPoint2{} + FloatingPoint{})>
    operator-(FloatingPoint2 fp) {
        using fp_type = decltype(FloatingPoint{} + FloatingPoint2{});
        number_bag<fp_type> ret(m_size, m_sum, m_square_sum);
        ret.remove(fp);
        return ret;
    }

    number_bag<FloatingPoint>& operator+=(FloatingPoint num) {
        add(num);
        return *this;
    }

    number_bag<FloatingPoint>& operator-=(FloatingPoint num) {
        remove(num);
        return *this;
    }
};

template<typename Num>
std::ostream& operator<<(std::ostream& out, number_bag<Num> const &bag)
{
    out << "[size=" << bag.size() << ", ave=" << bag.average() << ", var="
        << bag.variance() << ", std=" << bag.standard_deviation() << "]";
    return out;
}

} // end of namespace coderodde::stat
} // end of namespace coderodde

#endif // CODERODDE_STAT_NUMBER_BAG

main.cpp

#include "numberbag.hpp"
#include <iostream>

int main(int argc, const char * argv[]) {
    using coderodde::stat::number_bag;
    using namespace std;

    number_bag<float> bag;

    bag.add(1.0);
    bag.add(1.0);
    bag.add(3.0);

    cout << "bag:  " << bag << "\n";

    auto bag2 = bag - 1.0;

    cout << "bag:  " << bag << "\n";
    cout << "bag2: " << bag2 << "\n";

    auto bag3 = bag2 + (long double) 1.0;

    cout << "bag3: " << bag3 << "\n";

    bag.clear();

    cout << "bag:  " << bag << "\n";

    bag += 2.0f;
    bag += 7.0f;

    cout << "bag:  " << bag << "\n";
    return 0;
}

Demo output

bag:  [size=3, ave=1.66667, var=1.33333, std=1.1547]
bag:  [size=3, ave=1.66667, var=1.33333, std=1.1547]
bag2: [size=2, ave=2, var=2, std=1.41421]
bag3: [size=3, ave=1.66667, var=1.33333, std=1.1547]
bag:  [size=0, ave=nan, var=nan, std=nan]
bag:  [size=2, ave=4.5, var=12.5, std=3.53553]

Critique request

Please tell me anything there is to improve.

I think this is the best version for generic case. The only thing that bugs me is expression decltype(FloatingPoint2{} + FloatingPoint{}), which might not always compile. May be it could be SFINAE'd away? Though that's an extreme case, and up to you if you want support that level of flexibility. It will require to put much more thought into the code, and probably most of the time is not worth it. — Incomputable
– Incomputable, Commented Apr 25, 2017 at 18:05
@Incomputable I saw Bjarne doing that trick in the C++PL book so I thought it should be "good" C++ by definition. — coderodde
– coderodde, Commented Apr 25, 2017 at 18:07
it is quite good. Its just that I try to adhere to C++ library concepts, and unfortunately, I cannot identify any concept that would be fit for the expression. May be it is worth doing some research about it. I have exams, so I won't be able to thoroughly investigate it for a week or two. I would say that the trick is certainly worth adapting, but should have some fallback mechanism. Or at least clear error message :) — Incomputable
– Incomputable, Commented Apr 25, 2017 at 18:18
Please be aware that there are two different types of variance and standard deviation: Population Variance and Sample Variance. You'd do well to clearly document which it is. — Emily L.
– Emily L., Commented Apr 26, 2017 at 13:14
@coderodde, it looks like your code is population variance. From the definition it looks like the variance that includes every number is population variance, and sampling one is when just some of them are selected. I'm 100% sure though. — Incomputable
– Incomputable, Commented Apr 26, 2017 at 18:03

Rakete1111 · Accepted Answer · 2017-04-27 16:07:46Z

Some general comments:

If you compile with gcc, use -Wall -Wextra -pedantic (or every warning turned on, with extensions disabled) - As posted, your code generates 2 warnings:

main.cpp: In function 'int main(int, const char**)':
main.cpp:113:14: warning: unused parameter 'argc' [-Wunused-parameter]
 int main(int argc, const char * argv[]) {
      ^~~~
main.cpp:113:38: warning: unused parameter 'argv' [-Wunused-parameter]
 int main(int argc, const char * argv[]) {
                              ^

You can safely remove the parameters from main. As a side note, argv has to be a char*[], not a const char*[].

Avoid using float if you can, double should be used, as the higher precision comes at a very negligible performance hit. See this answer for more info.
number_bag doesn't have a virtual destructor, which implies that you do not intend to derive from it. If that is the case, mark the class as final to be sure:
```
class number_bag final {
```
Instead of having a default constructor that only initializes variables to a specific value (or uses aggregate initialization like in your case), prefer to do that when you define the variable:
```
size_t m_size = 0;
FloatingPoint m_sum{};
FloatingPoint m_square_sum{};
```
Don't forget to default your constructor:
```
number_bag() = default;
```
Don't do this please:
```
size_t        m_size;
FloatingPoint m_sum;
FloatingPoint m_square_sum;
```
It may look nice, but if you ever add more members, or remove one, it can possibly become a maintenance problem.

That private is unnecessary, the default access specifier for a class is already private:

class number_bag {
/*private*/
    size_t        m_size;
    FloatingPoint m_sum;
    FloatingPoint m_square_sum;

Mark functions that should not throw noexcept. This helps the compiler optimize if you compile with exceptions.
Use std::move to avoid unnecessary copies:
```
number_bag(size_t size, FloatingPoint sum, FloatingPoint square_sum) :
    m_size{size},
    m_sum{std::move(sum)},
    m_square_sum{std::move(square_sum)} {}
```
The std::move is unnecessary for PODs, but because FloatingPoint is a template type, it can possibly a high precision floating point type, which is possibly expensive to copy and possibly cheap to move. So, use std::move for them.

This looks weird to me:

decltype(FloatingPoint2{} + FloatingPoint{})

(Note: Don't use std::common_type, it fails for most user-defined types)

You can rewrite it as follows, using fp instead of FloatingPoint2{} (and some other improvements written about in some other points)

template<typename FloatingPoint2>
auto operator+(FloatingPoint2 fp) noexcept {
    number_bag<decltype(FloatingPoint{} + fp)> ret(m_size, m_sum, m_square_sum);
    ret.add(fp);
    return ret;
}

Even if fp_type was necessary, please don't define it to just use it once, that's not really efficient in terms of code size.
Use auto to simplify operator+'s and operator-'s return type.
Don't use C-style casts, use C++ style casts instead. In your case, you can even use long double literals instead of an ugly cast:
```
auto bag3 = bag2 + 1.0l; // or 1.l
```
Always use the pre-increment operator (++i;) instead of the post-increment operator (i++;), unless you have a specific reason not to. Although a good compiler might generate in almost every case the same assembly code, there can exist a class that has side effects in its construction, and as such a compiler is not allowed to remove those.
this-> is in every case you use it redundant. Remove it, it just clutters the view. :)
I don't see why you implemented your own copy constructor and copy assignment operator. You literally have no reason to, they do the same as the implicit defined ones.

That's actually doing you a disservice: If a copy constructor is implemented, then the implicit move constructor will be deleted, and every move of number_bag results in an possibly inefficient copy (for expensive to copy FloatingPoints). Just remove them.
If you overload operator<<, I think you should also overload operator>>. That is however just what I would do.
You don't need to specify the template parameter in the template class itself if you use the same template parameters:
```
number_bag/*<FloatingPoint>*/& operator+=(FloatingPoint num) {
    add(num);
    return *this;
}
```
Maybe add some more operator overloads? Like operator+=(number_bag)?

Comments are not for extended discussion; this conversation has been moved to chat. — Mathieu Guindon
– Mathieu Guindon, Commented Apr 27, 2017 at 17:42

user1118321 · Accepted Answer · 2017-04-26 05:04:02Z

I mostly agree with Rakete1111. I'd add the following:

Don't Allow a Caller to Create an Invalid Instance

You have this constructor:

number_bag(size_t size, FloatingPoint sum, FloatingPoint square_sum) :
    m_size{size},
    m_sum{sum},
    m_square_sum{square_sum} {}

From looking at the add method, square_sum should always be the sum of the squares of the values that were added to create sum. If that's the case, then you shouldn't allow the user to set both sum and square_sum in a constructor. You should calculate one from the other:

number_bag(const size_t size, const FloatingPoint sum) :
    m_size{size},
    m_sum{sum},
    m_square_sum{sum * sum} {}

That way a caller of the constructor can't do something stupid like:

number_bag<double> myBag(100, 5.0, -17.0);

Use `const` Where Appropriate

Also, since you aren't changing the values passed into the constructor, you might as well mark them as const. Doing so can make it easier for readers of your code to reason about it, and may help the compiler make better choices. This goes for the other methods that take parameters as well. At first glance, it appears that they could all be marked as const.

What, why only mostly? Did I recommend something wrong? Good point with the constructor though :) — Rakete1111
– Rakete1111, Commented Apr 26, 2017 at 14:03
My sticking points were numbers 5 and 9. I think #5 is acceptable because it can improve readability and you can group member variables by function and not share indenting across groups. (Minor stylistic thing.) And for #9, I just don't know enough about it to know whether what you're suggesting is right or wrong. I trust it's right, but don't feel confident saying I agree because I'm just not familiar with that particular issue. Overall, great answer. — user1118321
– user1118321, Commented Apr 26, 2017 at 16:22

Emily L. · Accepted Answer · 2017-04-27 19:44:41Z

Sample Variance vs Population Variance

There are two different ways computing a variance for a data set depending on if the data set contains all the values from the population or if the data set is just a sampling of values from some larger population.

If you ask 100 people on the street how many friends they have, you get a sample variance. If you ask all the people in the world, you get a population variance. If you ask all the people in your city, you get a sample variance for the world and a population variance for your city.

There is a subtle difference that if you don't have the entire population, you will get a bias error in the variance that needs to be corrected for by using sample variance formula.

Population Variance

Given a population of size \$N\$ and we have \$N\$ samples called \$x_1,...,x_N\$ the population variance is: $$\sigma^2=\frac{1}{N}\sum^N_{i=1}\left(x_i -\mu\right)^2 = \frac{1}{N}\sum_{i=1}^N\left\{x_i^2 - 2x_i\mu+\mu^2\right\}$$ lets simplify it a bit to: $$=\frac{1}{N}\sum_{i=1}^Nx_i^2 - 2\mu\sum_{i=1}^Nx_i + N\mu^2\qquad(1)$$ where \$\mu\$ is the population mean: $$\mu=\frac{1}{N}\sum^N_{j=1}x_j \qquad(2)$$.

Inserting \$(2)\$ into \$(1)\$ gives: $$\sigma^2=\frac{1}{N}\sum_{i=1}^Nx_i^2 - 2\frac{1}{N}\sum^N_{i=1}x_i\sum_{j=1}^Nx_j + N\left(\frac{1}{N}\sum^N_{j=1}x_j\right)^2$$ which again simplifies to: $$\sigma^2=\frac{1}{N}\sum_{i=1}^Nx_i^2 - \frac{1}{N}\left(\sum^N_{i=1}x_i\right)^2 $$ if we realise that the summation indexes \$i\$ and \$j\$ can be interchanged.

Sample variance

With the same population as above, this time we only have \$K<N\$ samples: \$y_1,...y_K\$.

We let the (biased) sample variance be: $$\sigma_y^2 = \frac{1}{K}\sum^K_{i=1}\left(y_i -\bar{y}\right)^2$$ where \$\bar{y}\$ is the sample mean $$ \bar{y}=\frac{1}{K}\sum^K_{j=1}y_j$$.

Note that the formula for the sample mean and population mean is the same technically the same: "take the arithmetic average of all your samples". However they differ semantically. The population mean is the "true" mean of the entire population, the sample mean is just an estimation of the "true" mean and contains some error (unless \$N=K\$ but then you have a population mean).

To realise that \$\sigma_y^2\$ is biased we have to get dirty with some probability theory, hold on to your britches!

We use something called "the expected value" which in layman's terms means "the average of very many attempts". So if you throw a fair 6 sided dice, we will call the outcome of one throw for \$z\$. We call \$z\$ a random variable (it's a discrete random variable to be exact) as the it doesn't have a known value, it's random. However we know the values it can take and the probability of each value.

Thus we can calculate the expected value as the weighted average of each value and their probability. So for our population, the expected value of any sample \$x\$ taken randomly is: $$E\left[x\right]= \sum^N_{j=1}x_jP\left(x_j\right)$$ where \$P\left(x_j\right)\$ is the probability of \$x_j\$ being the value of the random variable. As we are drawing samples from the population randomly we will assume that the distribution is uniform and thus: \$P\left(x_j\right)=\frac{1}{N}\$ hence we get:

$$ E\left[x\right] = \frac{1}{N}\sum^N_{j=1}x_j =\mu $$

How does this help us? Well we computed the (biased) sample variance for one possible sample set of the population. If we use the above to compute the variance for all possible sample sets from the population and average them (the expected sample variance), we will get the variance that we expect to get if we randomly pick a sample set from the population. We will compare this value to the "true" population variance \$\sigma^2\$.

Before we go though, there are some ground rules. Let \$a\$ and \$b\$ be random variables and \$c\$ a constant, then: $$E\left[a+b\right] = E\left[a\right]+E\left[b\right]$$ and $$E\left[c*a\right] = c*E\left[a\right]$$ these follow easily from the definition. And then we have $$E\left[E\left[a\right]\right]=E\left[a\right]$$ which means that the expected value of the expected value is, well the expected value which you can prove to yourself or accept it intuitively.

Also we will need the very definition of variance and it's expansions: $$Var\left(x\right)=E\left[\left(x-\mu\right)^2\right] = E\left[x^2-2x\mu + \mu^2\right]$$ remember that: \$E\left[x\right] = \mu\$ $$Var\left(x\right)=E\left[x^2\right]-2E\left[x\right]E\left[x\right] + \left(E\left[x\right]\right)^2=E\left[x^2\right]-\left(E\left[x\right]\right)^2$$.

Okay so we have: $$\sigma_y^2 = \frac{1}{K}\sum^K_{i=1}\left(y_i -\bar{y}\right)^2 = \frac{1}{K}\sum^K_{i=1}\left(y_i -\frac{1}{K}\sum^K_{j=1}y_j\right)^2$$ because all the samples \$y_1,...y_K\$ are selected randomly, \$\sigma_y^2\$ is a random variable.

The expected value is: $$E\left[\sigma_y^2\right] = \frac{1}{K}\sum^K_{i=1}E\left[\left(y_i -\frac{1}{K}\sum^K_{j=1}y_j\right)^2\right]$$ (remember the ground rules, moving constant and expected sum is sum of expected). We'll expand the square: $$E\left[\sigma_y^2\right] = \frac{1}{K}\sum^K_{i=1}E\left[y_i^2 -\frac{2y_i}{K}\sum^K_{j=1}y_j + \frac{1}{K^2}\sum^K_{j=1}y_j\sum^K_{j=1}y_j\right]$$ now things will get a bit messy. We will move the \$y_i\$ term into the sum as it is independent of the summation index and we will change the summation index on the last sum from \$j\$ to \$l\$: $$E\left[\sigma_y^2\right] = \frac{1}{K}\sum^K_{i=1}E\left[y_i^2 -\frac{2}{K}\sum^K_{j=1}y_iy_j + \frac{1}{K^2}\sum^K_{j=1}y_j\sum^K_{l=1}y_l\right]$$ now we will remove the elements from those sums that have the same index and make them into separate terms: $$E\left[\sigma_y^2\right] = \frac{1}{K}\sum^K_{i=1}E\left[y_i^2 -\frac{2}{K}\left(y_i^2 + \sum^K_{j\neq i}y_iy_j\right) + \frac{1}{K^2}\left( \sum^K_{j=1}y_j\sum^K_{l\neq j}y_l + \sum_{j=1}^Ky_j^2\right)\right]$$ now we propagate the \$E\$ by using the rules described earlier:

$$E\left[\sigma_y^2\right] = \frac{1}{K}\sum^K_{i=1}\left\{E\left[y_i^2\right] -\frac{2}{K}\left(E\left[y_i^2\right] + \sum^K_{j\neq i}E\left[y_iy_j\right]\right) + \frac{1}{K^2}\left(\sum^K_{j=1}E\left[y_j\right]\sum^K_{l\neq j}E\left[y_l\right] + \sum_{j=1}^KE\left[y_j^2\right]\right)\right\}$$ collect terms (note that: \$\sum_{i=1}^KE\left[x\right]=KE\left[x\right]\$ ): $$E\left[\sigma_y^2\right] = E\left[y_i^2\right]\frac{K-1}{K} -\frac{2}{K}\left(\sum^K_{j\neq i}E\left[y_iy_j\right]\right) + \frac{1}{K^2}\left(KE\left[y_j\right]\left(K-1\right)E\left[y_l\right]\right)$$

almost there, in the general case \$E[ab]\neq E[a]E[b]\$. However when \$a\$ and \$b\$ are independent random variables the equality holds. Are \$y_j\$ and \$y_i\$ independent? Well for small populations picking one \$y_j\$ doesn't affects the probability of \$y_i\$ as all remaining values are equally likely and we have \$j\neq i\$ in our summation. (To any statisticians out there, yeah I know it's a weak argument I don't remember the long proof, you'll just have to believe me here). So if we buy this argument $$E\left[y_iy_j\right]=E\left[y_i\right]E\left[y_j\right]=\left(E\left[y_i\right]\right)^2$$ then the whole shebang reduces to: $$E\left[\sigma_y^2\right] = E\left[y_i^2\right]\frac{K-1}{K} -\frac{2}{K}(K-1)\left(E\left[y_i\right]\right)^2 + \frac{1}{K}\left(K-1\right)\left(E\left[y_i\right]\right)^2$$ then reduce further $$E\left[\sigma_y^2\right] = \frac{K-1}{K}\left(E\left[y_i^2\right] -\left(E\left[y_i\right]\right)^2\right)$$ remember \$\sigma^2=E\left[x^2\right]−(E\left[x\right])^2\$ and finally we have: $$E\left[\sigma_y^2\right] = \frac{K-1}{K}\sigma^2$$ because \$E\left[x_i\right] =E\left[y_j\right]\$.

Which essentially means, calculating the sample variance like above will have a systematic error factor of \$\frac{K-1}{K}\$ when compared to the true population variance.

This is why we use the unbiased sample variance: $$s^2 = \frac{K}{K-1}\sigma_y^2 = \frac{1}{K-1}\left(\sum_{i=1}^Ky_i^2 - \frac{1}{K}\left(\sum^K_{i=1}y_i\right)^2\right)$$

Your code is Sample Variance

We have the population variance: $$\sigma^2=\frac{1}{N}\sum_{i=1}^Nx_i^2 - \frac{1}{N}\left(\sum^N_{i=1}x_i\right)^2 $$ and the sample variance: $$s^2 = \frac{1}{K-1}\left(\sum_{i=1}^Ky_i^2 - \frac{1}{K}\left(\sum^K_{i=1}y_i\right)^2\right)$$.

Compare to your code here:

FloatingPoint variance() const {
    FloatingPoint step1 = m_square_sum - (m_sum * m_sum) / m_size;
    return step1 / (m_size - 1);
}

which matches the formula for sample variance. But I don't know how your application, you need to think about it and figure out if you need the population or sample variance.

Gosh look at what you did? You made me do maths lol :)

@coderodde I knew the theory from before, had to refresh the rules and general outline of the proof in my reference books. :) — Emily L.
– Emily L., Commented Apr 28, 2017 at 8:18
Btw, \$\operatorname{E} [X^2]=\sum _{i=1}^{n}x_{i}^2 \,p_{i}\$ ? — coderodde
– coderodde, Commented Apr 28, 2017 at 8:46
@coderodde yes. Let \$y_i=x_i^2\$ thus \$Y=X^2\$. So we have \$E[Y] = \sum_{i=1}^ny_iP(y_i)\$ and \$P(y_i)=P(x_i)\$. It's just a mapping of the value space, the probability space is unchanged. — Emily L.
– Emily L., Commented Apr 28, 2017 at 8:54

Community · Accepted Answer · 2020-06-10 13:24:26Z

Naming

You provide var() and std() methods, but are not clear whether these refer to sample variance and standard deviation or to population variance and standard deviation. I think you should provide both types, and name the methods accordingly.

Constructors

Is there a use case for the 3-argument constructor? If not, don't provide it.

You shouldn't provide copy constructor or assignment operator when the default versions of those are correct (and in the former case, it's actively harmful, as it suppresses the default move constructor).

Consider providing a constructor that accepts a std::initializer_list of values to add to the bag.

Numerical stability

Be aware that subtracting two large numbers can result in a large loss of precision when the result is small compared to the inputs. This will cost you dear when you have a large number of very similar values in your bag - consider a test case such as 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.00001 or similar to show this.

Stack Exchange Network

A bag of numbers in C++ for constant time statistics queries - follow-up

numberbag.hpp

main.cpp

Demo output

Critique request

4 Answers 4

Don't Allow a Caller to Create an Invalid Instance

Use `const` Where Appropriate

Sample Variance vs Population Variance

Population Variance

Sample variance

Your code is Sample Variance

Naming

Constructors

Numerical stability

You must log in to answer this question.

Linked

Hot Network Questions

A bag of numbers in C++ for constant time statistics queries - follow-up

numberbag.hpp

main.cpp

Demo output

Critique request

4 Answers 4

Don't Allow a Caller to Create an Invalid Instance

Use const Where Appropriate

Sample Variance vs Population Variance

Population Variance

Sample variance

Your code is Sample Variance

Naming

Constructors

Numerical stability

You must log in to answer this question.

Linked

Related

Hot Network Questions

Use `const` Where Appropriate