What is the proper way to compute 95% confidence intervals with PyTorch for classification and regression?

I wanted to report 90, 95, 99, etc. confidence intervals on my data using PyTorch. But confidence intervals seems too important to leave my implementation untested or criticized so I wanted feedback - should be checked by at least some expert. Furthermore, I already noticed I got NaN values when my values when negative which make me think my code only works for classification (at the very least) but I also do regression. I am also surprised that using the numpy code directly actually gave me differentiable tensors…not something I was expecting.

So is this correct?:

import numpy as np
import scipy
import torch
from torch import Tensor

P_CI = {0.90: 1.64,
        0.95: 1.96,
        0.98: 2.33,
        0.99: 2.58,

def mean_confidence_interval_rfs(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n - 1)
    return m, h

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n - 1)
    return m, m - h, m + h

def ci(a, p=0.95):
    import numpy as np, scipy.stats as st
    st.t.interval(p, len(a) - 1, loc=np.mean(a), scale=st.sem(a))

# def ci(a, p=0.95):
#     import statsmodels.stats.api as sms
#     sms.DescrStatsW(a).tconfint_mean()

def compute_confidence_interval_classification(data: Tensor,
                                               by_pass_30_data_points: bool = False,
                                               p_confidence: float = 0.95
                                               ) -> Tensor:
    Computes CI interval
        [B] -> [1]
    According to [1] CI the confidence interval for classification error can be calculated as follows:
        error +/- const * sqrt( (error * (1 - error)) / n)

    The values for const are provided from statistics, and common values used are:
        1.64 (90%)
        1.96 (95%)
        2.33 (98%)
        2.58 (99%)
    Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:

    Observations in the validation data set were drawn from the domain independently (e.g. they are independent and
    identically distributed).
    At least 30 observations were used to evaluate the model.
    This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial
    distribution, that we have sufficient observations to approximate a normal distribution for the binomial
    distribution, and that via the central limit theorem that the more observations we classify, the closer we will get
    to the true, but unknown, model skill.

        - computed according to: https://machinelearningmastery.com/report-classifier-performance-confidence-intervals/

        - how does it change for other types of losses
    B: int = data.size(0)
    # assert data >= 0
    assert B >= 30 and (not by_pass_30_data_points), f' Not enough data for CI calc to be valid and approximate a' \
                                                     f'normal, you have: {B=} but needed 30.'
    const: float = P_CI[p_confidence]
    error: Tensor = data.mean()
    val = torch.sqrt((error * (1 - error)) / B)
    ci_interval: float = const * val
    return ci_interval

def compute_confidence_interval_regression():
    raise NotImplementedError

# - tests

def ci_test():
    x: Tensor = abs(torch.randn(35))
    ci_pytorch = compute_confidence_interval_classification(x)
    ci_rfs = mean_confidence_interval(x)

    x: Tensor = abs(torch.randn(35, requires_grad=True))
    ci_pytorch = compute_confidence_interval_classification(x)
    ci_rfs = mean_confidence_interval(x)

    x: Tensor = torch.randn(35) - 10
    ci_pytorch = compute_confidence_interval_classification(x)
    ci_rfs = mean_confidence_interval(x)

if __name__ == '__main__':
    print('Done, success! \a')


ci_rfs=(tensor(0.8259), tensor(0.5654), tensor(1.0864))
tensor(0.0796, grad_fn=<SqrtBackward>)
x.var()=tensor(0.4391, grad_fn=<VarBackward>)
ci_pytorch=tensor(0.1559, grad_fn=<MulBackward0>)
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/brandomiranda/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/metrics/metrics.py", line 154, in <module>
  File "/Users/brandomiranda/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/metrics/metrics.py", line 144, in ci_test
    ci_pytorch = compute_confidence_interval_classification(x, by_pass_30_data_points)

how do I fix the code above for regression e.g. negative values of arbitrary magnitude?

Sort of surprised there isn’t an implementation already and especially not an official PyTorch one, given how important CI is supposed to be…perhaps a deep learning bad habit? Rarely seen it in papers, unfortunately.



Confidence intervals (ci) compute:

  • the probability that the true mean is in the given interval (usually written mu_n +- ci


  • traditional confidence intervals statements only hold for statements about the value (parameter, random quantitiy, etc) we want to estimate being the mean
  • you have enough samples so that the analysis holds (e.g. the mean $mu_n = 1/n sum_i x_i$, where n>=30 is recommended)

If those assumptions hold (**i.e. your esitmating the true mean via the sample mean with a ± value **) then use the code bellow that I provided called torch_compute_confidence_interval for regression, classification, anything you want.

First, asfaik confidence intervals (ci) is an open research problem in deep learning (DL) - so more sophisticated answers probably exist. But I will provide a practical answer that I plan to use (and see others using when reporting results in DL).

To compute confidence intervals we have to understand a little bit of ci first. What they are is a probabilistic statement over the random surveys/samples of data sets that the mean you are trying to report is withing the reported interval. So when people say:

mean_error +- CI for p=95%

it means if you sampled 95 data sets you’d expect the true mean to lie in that interval 95 of the time (but you wouldn’t know which ones, so you can’t say for any specific interval you compute that the mean will be there).

This means you can only use it for reporting means. This is because the maths that goes behind it (which isn’t very hard) approximates the computation of the probability that the bound holds (or the confidence interval holds) by taking advantage that we can compute probabilities analytically for sample means because the approximate a normal according to the central limit theorem CLT. So the specific CI that is computed assumes the quanity you want to compute is a sample mean and computes your ± numbers using this normal approximation. Thus, usually it’s recomended to have n>=30 data points for the specific data set you are using but things can still work out nicely since ci can be computed with a t distribution instead of a normal (denoted z in stats software).

Given those assumptions you can simply do the following:

def torch_compute_confidence_interval(data: Tensor,
                                           confidence: float = 0.95
                                           ) -> Tensor:
    Computes the confidence interval for a given survey of a data set.
    n = len(data)
    mean: Tensor = data.mean()
    # se: Tensor = scipy.stats.sem(data)  # compute standard error
    # se, mean: Tensor = torch.std_mean(data, unbiased=True)  # compute standard error
    se: Tensor = data.std(unbiased=True) / (n**0.5)
    t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
    ci = t_p * se
    return mean, ci

I’ve tested it and compared it to things specialized for classification and they agree in values up to 1e-2 so the code works. Output:

Connected to pydev debugger (build 213.5744.248)
x_bernoulli.std()=tensor(0.5085, grad_fn=<StdBackward>)
ci_95_torch=tensor(0.1867, grad_fn=<MulBackward0>)
x.std()=tensor(1.0181, grad_fn=<StdBackward>)
ci_95_torch=tensor(0.3802, grad_fn=<MulBackward0>)

For more details see my ultimate-utils library where I comment on the maths in the docs: ultimate-utils/confidence_intervals.py at e81a8c3c4425b33e00b3ade172705f20b626b2b1 · brando90/ultimate-utils · GitHub

Comments on DL

If you are reporting the error of a specific model e.g. neural net, like this you are more or less reporting that the true mean error for that very specific neural net and weights lies in those bounds. But as I said this is an open research area so fancier things must be available e.g. consider some layers are actually random, etc.