We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via f(skipmissing(x)). For multiple-argument functions like cor we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).
A reasonable solution would be to use f(skipmissing(x), weights=w), with a typical definition being:
function f(s::SkipMissing{<:AbstractVector}; weights::AbstractVector)
size(s.x) == size(weights) || throw(DimensionMismatch())
inds= find(!ismissing, s.x)
f(view(s.x, inds), weights=view(weights, inds))
end
That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in skipmissing. A wrapper like skipmissing(weighted(x, w)) (inspired by what was proposed at JuliaLang/julia#33310) would be cleaner in that regard. But that would still be quite ad-hoc, as skipmissing currently only accepts collections (and weighted cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.
The generalization to multiple-argument functions poses the same challenges as cor. For these, the simplest solution would be to use a skipmissing keyword argument, a bit like pairwise. Again, the alternative would be to use wrappers like skipmissing(weighted(w, x, y)).
Overall, the problem is that we have conflicting goals:
- be able to skip missing values with functions that don't have any special support for them using
f(skipmissing(x))
- use a similar syntax for unweighted and weighted functions, e.g.
f(skipmissing(x)) vs f(skipmissing(x), weights=w), or f(skipmissing(x)) vs f(skipmissing(weighted(x, w))), or f(x, skipmissing=true) vs f(x, skipmissing=true, weights=w)
- use a similar syntax for single- and multiple-argument functions, e.g.
f(skipmissing(x)) vs f(skipmissing(x, y)), or f(x, skipmissing=true) vs f(x, y, skipmissing=true)
- use a similar syntax for simple functions operating on vectors (like
mean) and complex functions operating on whole tables (like fit(MODEL, ..., data=df, weights=w) and which skip missing values by default)
We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via
f(skipmissing(x)). For multiple-argument functions likecorwe don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).A reasonable solution would be to use
f(skipmissing(x), weights=w), with a typical definition being:That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in
skipmissing. A wrapper likeskipmissing(weighted(x, w))(inspired by what was proposed at JuliaLang/julia#33310) would be cleaner in that regard. But that would still be quite ad-hoc, asskipmissingcurrently only accepts collections (andweightedcannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.The generalization to multiple-argument functions poses the same challenges as
cor. For these, the simplest solution would be to use askipmissingkeyword argument, a bit likepairwise. Again, the alternative would be to use wrappers likeskipmissing(weighted(w, x, y)).Overall, the problem is that we have conflicting goals:
f(skipmissing(x))f(skipmissing(x))vsf(skipmissing(x), weights=w), orf(skipmissing(x))vsf(skipmissing(weighted(x, w))), orf(x, skipmissing=true)vsf(x, skipmissing=true, weights=w)f(skipmissing(x))vsf(skipmissing(x, y)), orf(x, skipmissing=true)vsf(x, y, skipmissing=true)mean) and complex functions operating on whole tables (likefit(MODEL, ..., data=df, weights=w)and which skip missing values by default)