-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Port Python's statistics module to Rust — averages, measures of spread, and probability distributions.
In CPython this is a pure Python implementation (Lib/statistics.py) that heavily uses fractions.Fraction internally for exact intermediate arithmetic to avoid cumulative rounding errors.
Public API (21 items)
Central tendency
mean(data)— arithmetic mean (exact via Fraction)fmean(data, weights=None)— fast float mean (via fsum)geometric_mean(data)— geometric mean (via log/exp)harmonic_mean(data, weights=None)— harmonic meanmedian(data)— median (average of middle two for even length)median_low(data)/median_high(data)— low/high medianmedian_grouped(data, interval=1.0)— grouped data medianmode(data)— single most common valuemultimode(data)— all modesquantiles(data, *, n=4, method='exclusive')— cut points
Spread
variance(data, xbar=None)— sample variancepvariance(data, mu=None)— population variancestdev(data, xbar=None)— sample standard deviationpstdev(data, mu=None)— population standard deviation
Bivariate
covariance(x, y)— sample covariancecorrelation(x, y, *, method='linear')— Pearson or Spearman correlationlinear_regression(x, y, *, proportional=False)— OLS regression
Kernel density estimation
kde(data, h, kernel='normal', *, cumulative=False)— returns PDF/CDF callablekde_random(data, h, kernel='normal', *, seed=None)— returns sampling callable
Distribution
NormalDist(mu=0.0, sigma=1.0)— normal distribution class- Methods:
pdf,cdf,inv_cdf,overlap,zscore,samples,quantiles - Class method:
from_samples(data) - Arithmetic:
+,-,*,/with scalars and otherNormalDist
- Methods:
Exception
StatisticsError(subclass ofValueError)
Key design considerations
Dependency on fractions
CPython's statistics module uses Fraction internally for exact arithmetic in mean, variance, stdev, harmonic_mean, covariance, correlation, and linear_regression. This means the fractions module (#16) should be implemented first, or at minimum concurrently.
Precision strategy
| Function group | CPython approach | Rust approach |
|---|---|---|
| mean, variance, harmonic_mean | Fraction-exact intermediate arithmetic | Use Fraction<BigInt> from #16 |
| fmean, geometric_mean | Float with fsum/log-exp | Use math::fsum (already in pymath) |
| stdev, pstdev | _float_sqrt_of_frac(n, d) specialized sqrt |
Implement equivalent |
| NormalDist.inv_cdf | Wichura's Algorithm AS241 (rational approximations) | Direct port of the piecewise approximation |
Type system
CPython statistics functions are polymorphic over int, float, Fraction, and Decimal. For the Rust port, the initial scope should focus on f64 inputs with exact Fraction-based intermediates where CPython does so, and return f64. Full type polymorphism can be added later via generics.
Implementation plan
Phase 1: Core averages (depends on #16)
- Internal
_sum()helper using Fraction for exact summation mean,fmean,geometric_mean,harmonic_meanStatisticsErrorerror type
Phase 2: Median & mode
median,median_low,median_high,median_groupedmode,multimodequantiles(exclusive and inclusive methods)
Phase 3: Variance & standard deviation
- Internal
_ss()helper (sum of squared deviations via Fraction) variance,pvariance,stdev,pstdev_float_sqrt_of_frac()for precision-preserving sqrt
Phase 4: Bivariate statistics
covariancecorrelation(linear and ranked methods)linear_regression(with proportional option)
Phase 5: NormalDist
- Constructor, properties (
mean,median,mode,stdev,variance) pdf,cdf(via erf frommath::erf)inv_cdf(Wichura's Algorithm AS241)overlap,zscore- Arithmetic operators
from_samples,samples,quantiles
Phase 6: KDE
kdewith all kernel types (normal, logistic, rectangular, triangular, etc.)kde_random- Cumulative mode support
Phase 7: Testing
- pyo3 proptest against CPython
statisticsmodule - Edge cases: empty data, single element, identical values, NaN/Inf handling
- Precision verification for Fraction-based functions
Feature flag
[features]
statistics = ["fractions"] # depends on fractions moduleOut of scope
Decimalinput support (separate concern)randommodule dependency for sampling (NormalDist.samples, kde_random)
References
- CPython
Lib/statistics.py - https://docs.python.org/3/library/statistics.html
- Wichura, M.J. (1988) Algorithm AS241