Skip to content

Add statistics module: Mathematical statistics functions #17

@youknowone

Description

@youknowone

Summary

Port Python's statistics module to Rust — averages, measures of spread, and probability distributions.

In CPython this is a pure Python implementation (Lib/statistics.py) that heavily uses fractions.Fraction internally for exact intermediate arithmetic to avoid cumulative rounding errors.

Public API (21 items)

Central tendency

  • mean(data) — arithmetic mean (exact via Fraction)
  • fmean(data, weights=None) — fast float mean (via fsum)
  • geometric_mean(data) — geometric mean (via log/exp)
  • harmonic_mean(data, weights=None) — harmonic mean
  • median(data) — median (average of middle two for even length)
  • median_low(data) / median_high(data) — low/high median
  • median_grouped(data, interval=1.0) — grouped data median
  • mode(data) — single most common value
  • multimode(data) — all modes
  • quantiles(data, *, n=4, method='exclusive') — cut points

Spread

  • variance(data, xbar=None) — sample variance
  • pvariance(data, mu=None) — population variance
  • stdev(data, xbar=None) — sample standard deviation
  • pstdev(data, mu=None) — population standard deviation

Bivariate

  • covariance(x, y) — sample covariance
  • correlation(x, y, *, method='linear') — Pearson or Spearman correlation
  • linear_regression(x, y, *, proportional=False) — OLS regression

Kernel density estimation

  • kde(data, h, kernel='normal', *, cumulative=False) — returns PDF/CDF callable
  • kde_random(data, h, kernel='normal', *, seed=None) — returns sampling callable

Distribution

  • NormalDist(mu=0.0, sigma=1.0) — normal distribution class
    • Methods: pdf, cdf, inv_cdf, overlap, zscore, samples, quantiles
    • Class method: from_samples(data)
    • Arithmetic: +, -, *, / with scalars and other NormalDist

Exception

  • StatisticsError (subclass of ValueError)

Key design considerations

Dependency on fractions

CPython's statistics module uses Fraction internally for exact arithmetic in mean, variance, stdev, harmonic_mean, covariance, correlation, and linear_regression. This means the fractions module (#16) should be implemented first, or at minimum concurrently.

Precision strategy

Function group CPython approach Rust approach
mean, variance, harmonic_mean Fraction-exact intermediate arithmetic Use Fraction<BigInt> from #16
fmean, geometric_mean Float with fsum/log-exp Use math::fsum (already in pymath)
stdev, pstdev _float_sqrt_of_frac(n, d) specialized sqrt Implement equivalent
NormalDist.inv_cdf Wichura's Algorithm AS241 (rational approximations) Direct port of the piecewise approximation

Type system

CPython statistics functions are polymorphic over int, float, Fraction, and Decimal. For the Rust port, the initial scope should focus on f64 inputs with exact Fraction-based intermediates where CPython does so, and return f64. Full type polymorphism can be added later via generics.

Implementation plan

Phase 1: Core averages (depends on #16)

  • Internal _sum() helper using Fraction for exact summation
  • mean, fmean, geometric_mean, harmonic_mean
  • StatisticsError error type

Phase 2: Median & mode

  • median, median_low, median_high, median_grouped
  • mode, multimode
  • quantiles (exclusive and inclusive methods)

Phase 3: Variance & standard deviation

  • Internal _ss() helper (sum of squared deviations via Fraction)
  • variance, pvariance, stdev, pstdev
  • _float_sqrt_of_frac() for precision-preserving sqrt

Phase 4: Bivariate statistics

  • covariance
  • correlation (linear and ranked methods)
  • linear_regression (with proportional option)

Phase 5: NormalDist

  • Constructor, properties (mean, median, mode, stdev, variance)
  • pdf, cdf (via erf from math::erf)
  • inv_cdf (Wichura's Algorithm AS241)
  • overlap, zscore
  • Arithmetic operators
  • from_samples, samples, quantiles

Phase 6: KDE

  • kde with all kernel types (normal, logistic, rectangular, triangular, etc.)
  • kde_random
  • Cumulative mode support

Phase 7: Testing

  • pyo3 proptest against CPython statistics module
  • Edge cases: empty data, single element, identical values, NaN/Inf handling
  • Precision verification for Fraction-based functions

Feature flag

[features]
statistics = ["fractions"]  # depends on fractions module

Out of scope

  • Decimal input support (separate concern)
  • random module dependency for sampling (NormalDist.samples, kde_random)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions