Add hsm_mode: Half-Sample Mode for continuous data#984
Add hsm_mode: Half-Sample Mode for continuous data#984anurag-mds wants to merge 1 commit intoJuliaStats:masterfrom
Conversation
|
Was this PR AI-generated? |
|
The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail |
I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data. Or perhaps introduce a keyword argument: mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
discrete ? mode_discrete(x) : mode_hsm(x)
mode(x::AbstractVector) = mode_discrete(x)Here, I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion. |
|
I think using frequency-based mode for floats can be misleading. But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values. Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right. |
|
@ForceBru Here's the evidence for why
As you can see Design:
Users can explicitly choose the right tool. No silent behavior changes. So are you okay with this separate function approach or shall we explore the keyword argument variant further ? |
|
All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval? |
|
Thanks for the PR! I tend to agree that a keyword argument would make sense here. The mode is a statistical concept which can be estimated in various ways. The one currently used by The API could look like |
|
Thanks for your detailed review |
|
Apologies for the back-and-forth and the CI noise I'm currently away from my main development setup, but I’ve noted all changes precisely and will push a consolidated update shortly once I’m back. I’ll comment again once the updates are in. |
|
@nalimilan Does everything seems good? |
|
Thanks @nalimilan for the thorough review. I agree with the remaining points (doc wording, method naming, references, and test adjustments). I’m addressing them now and will push a final cleanup commit so that |
|
@nalimilan I have made necessary changes if anything to fix, modify or add do let me know! |
|
@nalimilan What do you think? |
test/scalarstats.jl
Outdated
| @test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:default) == 4 | ||
| @test mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:halfsample) == 4.0 # Test halfsample with range | ||
| @test_throws ArgumentError mode([1, 2, 2, 3, 4, 4, 4, 5], 1:5, method=:invalid) |
There was a problem hiding this comment.
Test with 2:4 to actually test that argument.
|
Everything Noted. |
|
Rebasing took more time Is everything good @nalimilan ? |
|
Is there anything left @nalimilan ? |
src/scalarstats.jl
Outdated
| filtered = sort([x for x in a if isfinite(x)]) | ||
| len = length(filtered) | ||
|
|
||
| len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values")) |
There was a problem hiding this comment.
I can't see this restriction in the other mode implementation above. So maybe that is only a limitation of half-sample mode?
There was a problem hiding this comment.
Yes, it is
As you can see for Regular Frequency Mode:
Consider [Nan, Inf, -Inf]
Here,
- It can still count frequencies
- Nan appears once, Inf appears once
- Returns first one
Now, for HSM Mode:
Consider the same array,
Here,
- Filters out all non-finite values
- Nothing left to calculate
- MUST throw error
- Cannot find a cluster with no data!
There was a problem hiding this comment.
Then the error message should be more explicit:
| len == 0 && throw(ArgumentError("mode is not defined for collections with no finite values")) | |
| len == 0 && throw(ArgumentError("mode with `method=:halfsample` is not defined " * | |
| "for collections with no finite values")) |
|
I've addressed the final performance suggestions (sort! and LazyString). Ready for final review. |
|
Can you please review the updated changes now? |
nalimilan
left a comment
There was a problem hiding this comment.
Sorry for the delay. Looks almost ready to me.
@devmotion Do you have better ideas of a name for method=:frequency? The definition of the mode implies frequency whatever the estimation method, but I can't find a better name...
95ab2f5 to
fe1bc03
Compare
|
@nalimilan Done applied the suggested changes |
|
Fixed the remaining points One change I made proactively based on your comment about:
Happy to revert this if you'd prefer the stricter behaviour. |
Please throw an |
src/scalarstats.jl
Outdated
| while len > 2 | ||
| half = cld(len, 2) | ||
| best_i = 1 | ||
| best_width = filtered[half] - filtered[1] |
There was a problem hiding this comment.
Need to use filteredv here and below, right? I'm a bit worried that this wasn't caught by tests. Isn't this covered?
There was a problem hiding this comment.
The reason the tests didn't catch it is that the previous test cases were too simple they didn't 'squeeze' the data enough to show the mistake. I've now updated the code to use filteredv everywhere and added a more complex test case ([1.0, 2.0, ... 13.0]) that would have failed with the old bug.
I also switched to a strict ArgumentError for any non-finite values as you suggested. Ready for another look!
|
I also wanted to say I'm genuinely sorry for the extra work this PR has caused. When this PR was created it was my very first PR, and I'm honestly a bit embarrassed that I ended up increasing your workload instead of making it easier. I’m learning a lot about the project's standards through this process, and I really appreciate your patience and guidance in helping me get this implementation right. |

Summary
This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data
This addresses and closes issue #957.
Motivation
StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practiceIssue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.
This PR provides an estimator designed specifically for continuous data
Approach
hsm_mode() implements the standard half-sample method described in the literature:Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.
API Design
The estimator is exposed as a new function:
hsm_mode(x::AbstractVector{T}) where T<:Real
It is NOT added as an overload of mode() in order to:
avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly
The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).
Testing and Documentation
Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.
The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.
References
Robertson & Cryer (1974), JASA
Bickel & Fruehwirth (2006), CSDA
Notes
This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.
I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.
Images:






Feedback on naming or API placement is welcome.