Missing values and weighting

We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via `f(skipmissing(x))`. For multiple-argument functions like `cor` we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).

A reasonable solution would be to use `f(skipmissing(x), weights=w)`, with a typical definition being:
```julia
function f(s::SkipMissing{<:AbstractVector}; weights::AbstractVector)
    size(s.x) == size(weights) || throw(DimensionMismatch())
    inds= find(!ismissing, s.x)
    f(view(s.x, inds), weights=view(weights, inds))
end
```
That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in `skipmissing`. A wrapper like `skipmissing(weighted(x, w))` (inspired by what was proposed at https://github.com/JuliaLang/julia/pull/33310) would be cleaner in that regard. But that would still be quite ad-hoc, as `skipmissing` currently only accepts collections (and `weighted` cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.

The generalization to multiple-argument functions poses the same challenges as `cor`. For these, the simplest solution would be to use a `skipmissing` keyword argument, a bit like [`pairwise`](https://juliastats.org/StatsBase.jl/latest/misc/#StatsAPI.pairwise). Again, the alternative would be to use wrappers like `skipmissing(weighted(w, x, y))`.

Overall, the problem is that we have conflicting goals:
- be able to skip missing values with functions that don't have any special support for them using `f(skipmissing(x))`
- use a similar syntax for unweighted and weighted functions, e.g. `f(skipmissing(x))` vs `f(skipmissing(x), weights=w)`, or `f(skipmissing(x))` vs `f(skipmissing(weighted(x, w)))`, or  `f(x, skipmissing=true)` vs `f(x, skipmissing=true, weights=w)`
- use a similar syntax for single- and multiple-argument functions, e.g. `f(skipmissing(x))` vs `f(skipmissing(x, y))`, or `f(x, skipmissing=true)` vs `f(x, y, skipmissing=true)`
- use a similar syntax for simple functions operating on vectors (like `mean`) and complex functions operating on whole tables (like `fit(MODEL, ..., data=df, weights=w)` and which skip missing values by default)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values and weighting #88

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing values and weighting #88

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions