Add hsm_mode: Half-Sample Mode for continuous data by anurag-mds · Pull Request #984 · JuliaStats/StatsBase.jl

anurag-mds · 2025-12-27T09:12:39Z

Summary

This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.

It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data

This addresses and closes issue #957.

Motivation

StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practice

Issue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.

This PR provides an estimator designed specifically for continuous data

Approach

hsm_mode() implements the standard half-sample method described in the literature:

Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.

API Design

The estimator is exposed as a new function:

hsm_mode(x::AbstractVector{T}) where T<:Real

It is NOT added as an overload of mode() in order to:

avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly

The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).

Testing and Documentation

Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.

The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.

References

Robertson & Cryer (1974), JASA

Bickel & Fruehwirth (2006), CSDA

Notes

This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Images:

Feedback on naming or API placement is welcome.

devmotion · 2025-12-27T11:01:55Z

Was this PR AI-generated?

anurag-mds · 2025-12-27T12:10:38Z

The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail

ForceBru · 2025-12-28T00:06:10Z

introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour

I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data.

Or perhaps introduce a keyword argument:

mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
  discrete ? mode_discrete(x) : mode_hsm(x)

mode(x::AbstractVector) = mode_discrete(x)

Here, mode_discrete is what's currently called mode. The keyword argument lets one use the high-variance estimator for AbstractFloat.

I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion.

anurag-mds · 2025-12-28T06:08:38Z

I think using frequency-based mode for floats can be misleading.

But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values.

Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right.

anurag-mds · 2025-12-28T07:14:36Z

@ForceBru
You have made an excellent point about API design consistency. You're right
that distinguishing frequency-based vs density-based estimation matters.

Here's the evidence for why hsm_mode() should remain separate:

As you can see
Insight: Frequency-based mode() counts collisions on continuous
data where samples are usually unique. The "most frequent" element becomes
arbitrary and order-dependent. HSM finds the actual density peak instead.

Design:

mode(): Frequency-based, correct for discrete data [keep as-is]
hsm_mode(): Density-based, essential for continuous data [separate]

Users can explicitly choose the right tool. No silent behavior changes.

So are you okay with this separate function approach or shall we explore the keyword argument variant further ?

anurag-mds · 2025-12-29T20:58:20Z

All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval?

nalimilan · 2026-01-05T21:12:16Z

Thanks for the PR!

I tend to agree that a keyword argument would make sense here. The mode is a statistical concept which can be estimated in various ways. The one currently used by mode is the simplest, and of course it's bad for continuous samples unless you apply some binning. But mode(Normal()) returns zero already so the concept makes sense in general.

The API could look like mode(x, method=:halfsample).

src/scalarstats.jl

test/scalarstats.jl

anurag-mds · 2026-01-06T08:53:05Z

Thanks for your detailed review
This is very helpful. I'll address the points you raised API via keyword, iterator support, middle , non-finite handling, test adjustments and fix the CI failures.

anurag-mds · 2026-01-06T23:54:28Z

Apologies for the back-and-forth and the CI noise I'm currently away from my main development setup, but I’ve noted all changes precisely and will push a consolidated update shortly once I’m back. I’ll comment again once the updates are in.
Thanks again for the detailed guidance.

anurag-mds · 2026-01-25T16:24:19Z

@nalimilan Does everything seems good?

nalimilan

Thanks!

src/scalarstats.jl

test/scalarstats.jl

src/scalarstats.jl

test/scalarstats.jl

anurag-mds · 2026-02-03T17:28:01Z

Thanks @nalimilan for the thorough review. I agree with the remaining points (doc wording, method naming, references, and test adjustments). I’m addressing them now and will push a final cleanup commit so that
everything is consolidated.

anurag-mds · 2026-02-03T18:20:05Z

@nalimilan I have made necessary changes if anything to fix, modify or add do let me know!
Thanks a lot for the guidance

anurag-mds · 2026-02-10T07:30:40Z

@nalimilan What do you think?

src/scalarstats.jl

test/scalarstats.jl

nalimilan · 2026-02-10T20:39:44Z

test/scalarstats.jl

+@test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:default) == 4
+@test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:halfsample) == 4.0  # Test halfsample with range
+@test_throws ArgumentError mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:invalid)


Test with 2:4 to actually test that argument.

test/scalarstats.jl

src/scalarstats.jl

test/scalarstats.jl

src/scalarstats.jl

test/scalarstats.jl

src/scalarstats.jl

anurag-mds · 2026-02-11T13:55:09Z

Everything Noted.
I will make sure that all issues are properly addressed...

anurag-mds · 2026-02-20T20:00:10Z

Rebasing took more time Is everything good @nalimilan ?

anurag-mds · 2026-02-27T11:12:31Z

Is there anything left @nalimilan ?

src/scalarstats.jl

devmotion · 2026-02-28T23:07:39Z

src/scalarstats.jl

+    filtered = sort([x for x in a if isfinite(x)])
+    len = length(filtered)
+
+    len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values"))


I can't see this restriction in the other mode implementation above. So maybe that is only a limitation of half-sample mode?

Yes, it is

As you can see for Regular Frequency Mode:

Consider [Nan, Inf, -Inf]
Here,

It can still count frequencies

Nan appears once, Inf appears once

Returns first one

Now, for HSM Mode:

Consider the same array,
Here,

Filters out all non-finite values

Nothing left to calculate

MUST throw error

Cannot find a cluster with no data!

Then the error message should be more explicit:

Suggested change

len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values"))

len == 0 && throw(ArgumentError("mode with `method=:halfsample` is not defined " *

"for collections with no finite values"))

anurag-mds · 2026-03-02T15:33:39Z

I've addressed the final performance suggestions (sort! and LazyString). Ready for final review.
@nalimilan @devmotion

anurag-mds · 2026-03-06T07:38:28Z

Can you please review the updated changes now?

nalimilan

Sorry for the delay. Looks almost ready to me.

@devmotion Do you have better ideas of a name for method=:frequency? The definition of the mode implies frequency whatever the estimation method, but I can't find a better name...

src/scalarstats.jl

test/scalarstats.jl

anurag-mds · 2026-03-13T06:58:07Z

@nalimilan Done applied the suggested changes
Extremely sorry for such a noisy PR
But I got to know many standards and style julia follows

src/scalarstats.jl

test/scalarstats.jl

src/scalarstats.jl

anurag-mds · 2026-03-14T08:58:14Z

Fixed the remaining points at most typo, exact test value pinned (1.0750000000000002), error message updated to 'for collections containing only NaN values', filteredv view used throughout for type stability, frequency tests updated to use longer vector, docstring signature added.

One change I made proactively based on your comment about:

Inf: changed the filter from isfinite to !isnan, so Inf and -Inf are now allowed through.
mode([NaN, Inf, -Inf]) returns NaN rather than throwing.

Happy to revert this if you'd prefer the stricter behaviour.

nalimilan · 2026-03-14T21:51:49Z

mode([NaN, Inf, -Inf]) returns NaN rather than throwing.

Please throw an ArgumentError instead in the presence of NaN, this seems more explicit to me and it's easier to change later if we want.

nalimilan · 2026-03-14T21:52:22Z

src/scalarstats.jl

+    while len > 2
+        half = cld(len, 2)
+        best_i = 1
+        best_width = filtered[half] - filtered[1]


Need to use filteredv here and below, right? I'm a bit worried that this wasn't caught by tests. Isn't this covered?

The reason the tests didn't catch it is that the previous test cases were too simple they didn't 'squeeze' the data enough to show the mistake. I've now updated the code to use filteredv everywhere and added a more complex test case ([1.0, 2.0, ... 13.0]) that would have failed with the old bug.

I also switched to a strict ArgumentError for any non-finite values as you suggested. Ready for another look!

anurag-mds · 2026-03-14T22:59:53Z

I also wanted to say I'm genuinely sorry for the extra work this PR has caused. When this PR was created it was my very first PR, and I'm honestly a bit embarrassed that I ended up increasing your workload instead of making it easier. I’m learning a lot about the project's standards through this process, and I really appreciate your patience and guidance in helping me get this implementation right.

anurag-mds closed this Dec 27, 2025

anurag-mds reopened this Dec 27, 2025

nalimilan reviewed Jan 5, 2026

View reviewed changes

anurag-mds marked this pull request as draft January 25, 2026 15:49

anurag-mds marked this pull request as ready for review January 25, 2026 16:21

nalimilan reviewed Jan 31, 2026

View reviewed changes

anurag-mds marked this pull request as draft February 3, 2026 18:11

anurag-mds marked this pull request as ready for review February 3, 2026 18:19

nalimilan reviewed Feb 10, 2026

View reviewed changes

anurag-mds marked this pull request as draft February 20, 2026 19:50

anurag-mds marked this pull request as ready for review February 20, 2026 19:59

devmotion reviewed Feb 28, 2026

View reviewed changes

anurag-mds requested review from devmotion and nalimilan March 1, 2026 06:26

nalimilan reviewed Mar 12, 2026

View reviewed changes

anurag-mds force-pushed the hsm-mode branch 9 times, most recently from 95ab2f5 to fe1bc03 Compare March 13, 2026 06:47

anurag-mds requested a review from nalimilan March 13, 2026 06:56

nalimilan reviewed Mar 13, 2026

View reviewed changes

anurag-mds force-pushed the hsm-mode branch from fe1bc03 to f5ca39a Compare March 14, 2026 08:43

anurag-mds requested a review from nalimilan March 14, 2026 09:00

nalimilan reviewed Mar 14, 2026

View reviewed changes

Add half-sample mode via mode(x; method=:halfsample)

9c163d9

anurag-mds force-pushed the hsm-mode branch from f5ca39a to 9c163d9 Compare March 14, 2026 22:54

anurag-mds requested a review from nalimilan March 14, 2026 23:00

	len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values"))
	len == 0 && throw(ArgumentError("mode with `method=:halfsample` is not defined " *
	"for collections with no finite values"))

Conversation

anurag-mds commented Dec 27, 2025

Summary

Motivation

Approach

API Design

Testing and Documentation

References

Notes

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Uh oh!

devmotion commented Dec 27, 2025

Uh oh!

anurag-mds commented Dec 27, 2025

Uh oh!

ForceBru commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anurag-mds commented Dec 29, 2025

Uh oh!

nalimilan commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anurag-mds commented Jan 6, 2026

Uh oh!

anurag-mds commented Jan 6, 2026

Uh oh!

anurag-mds commented Jan 25, 2026

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anurag-mds commented Feb 3, 2026

Uh oh!

anurag-mds commented Feb 3, 2026

Uh oh!

anurag-mds commented Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

nalimilan Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anurag-mds commented Feb 11, 2026

Uh oh!

anurag-mds commented Feb 20, 2026

Uh oh!

anurag-mds commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anurag-mds commented Dec 28, 2025 •

edited

Loading