Add spreadmissings#122
Conversation
|
I just did some re-factoring. New implementation is Miraculously, this infers function infers. |
|
Okay after playing around with this here's what I think an API should be. Consider the call Recall that
It automatically finds the vector values and calls
The problem with If we use |
|
We also need
|
|
Maybe it would be more elegant to restrict it to functions for which all arguments are AbstractVector, i.e. |
|
Thats what I had initially. Then I thought about broadcasting and how it automatically figures out what should be broadcasted. So I updated this PR to emulate that behavior a little. |
|
Another thing is that So it may be better to use in function extend_missings(v::AbstractVector{T}, esample::BitVector)
out = Vector{Union{Missing, T}(missing, length(esample))
out[esample] = v
return out
endThis would allow external packages to define specialized methods of |
|
That's a good point. Maybe I will try to formalize all of this behavior into a separate package first and see how it goes. |
|
It'd be great to have such a function! It's just hard to find the right abstraction, so it's good to experiment. |
|
Requiring packages to extend |
I changed it, but I don't think the implementation is quite correct. In particular, I'm worried about the scalar case. I'm not sure what the importance of the first argument in |
|
I think it would be totally awesome if @nalimilan or @quinnj took over this PR. It's over my head to deal with all the performance considerations and necessary benchmarking. If possible maybe we can merge it as an undocumented feature and see if it works? |
| else | ||
| return f.f(xs...; kwargs...) | ||
| end |
There was a problem hiding this comment.
How about throwing an error in this case for now? It doesn't seem very useful.
There was a problem hiding this comment.
No I disagree. I want people to be able to code defensively with this tool. throwing an error would defeat the purpose a bit.
There was a problem hiding this comment.
OK. Something that occurred to me though: should we do something special if one of the non-vector arguments is missing (like returning missing)? Should we reserve this behavior for the future in case we want spreadmissings to be a more general version of passmissing?
There was a problem hiding this comment.
I don't think so. We definitely can't return missing, we would have to return a vector of missings, which seems very restrictive.
There was a problem hiding this comment.
Not sure what "restrictive" means here. That would be similar to e.g. broadcast(+, missing, [1, 2, 3]).
Whether that makes sense depends on what f does with its inputs, but I can't find examples of functions to which one would pass one or more vectors plus a missing argument and would not expect the resulting vector to be full of missings. So reserving the situation where one argument is missing could be a good idea in case we want to handle it later.
There was a problem hiding this comment.
What do you think? At least throwing an error if any of the inputs is missing for now doesn't seem too problematic?
|
@nalimilan - please ping me when the first round of comments is resolved. Then I will review. Thank you! |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
Okay after the merging of just to confirm, do we want this new feature to live in Missings.jl for now? Or do we want me to move all this stuff to DataFramesMeta. |
|
Can you please summarize briefly the consensus how things should work? Thank you! |
|
(Quoting myself.)
Of course, cor(collect.(skipmissings(x, y))...)
Well, to be honest, I'm more interested in reduction functions. I feel like the need for a uniform missing-skipping syntax for reduction functions is more pressing than the need to handle functions like But I did address the "DataFrames scenario" with non-reduction functions. I proposed either writing them by hand, like this, transform(df, :x => (x -> x .- mean(skipmissing(x))) => :x_c)or adding keyword arguments to I think trying to combine the behavior for reduction functions and non-reduction functions into one function produces a complicated, hard to understand function. |
|
Making cor(collect.(skipmissings(x, y))...)look a little nicer without the |
Base devs rejected that proposal. see here. Looking back, I agree with the logic a bit more. Vectors are a natural way to enforce indices matching, whereas iterators will never be as clear. |
|
Yeah, it's debatable. But it seems like they might be open to a new method like |
|
Copying over from Discourse... It's a little goofy, but we could signal to Setup: struct NeedsArray{T}
x::T
end
const ⋆ = NeedsArrayWhat it would look like: skipm(cor)(⋆x, ⋆y) |
|
Wow, the In view of that unfortunate situation, perhaps the Base devs could be convinced to allow |
|
@CameronBieganek I think you should develop your use cases and/or make a detailed proposal somewhere else. You have hijacked a pull request about a well-defined function which was in the final review phase. This is not a design issue intended at experimenting with various ideas. The result is a confusing discussion which hasn't allowed me to grasp what are your main requirements. |
|
Apologies for hijacking a pull request. I have concerns about the design and usability of |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
|
@nalimilan I have fixed all the issues you have highlighted. Thank you for your feedback! Three things to keep in mind
I've determined this is not due to the fact that How much should I worry about this? Is there anything we can do? Maybe I could re-write the function so it dispatches on EDIT: I' played around with this and it looks like creating specialized structs does fix the type stability issues. However I haven't been able to have both the Is there something we can do to make this work? I don't really want to make users write EDIT 2: Maybe it's impossible to get type stability under any conditions. I'm not 100% sure.
|
|
Have you tried adding
|
I'm well past that point. It would be easier for me to follow if each of these methods had its own function name rather than dispatching on keyword arguments. |
|
@nalimilan this is ready for a review! The inference failures were due to a lack of function barriers. Adding I've filed an issue upstream for a lack of type inference with I also added
It's worth emphasizing again that if none of the inputs are vectors, then behavior is entirely unaffected. So no error is thrown if the inputs are not vectors and the output is a vector. This behavior is indeed complicated, but it's all in the docstring. |
|
@nalimilan Bumping this. Remember that we ultimately want to try this out in DataFramesMeta.jl first. I will port the PR over there eventually but am editing it here. |
nalimilan
left a comment
There was a problem hiding this comment.
Sorry. Here are more comments.
| function spread_missing( | ||
| res::AbstractVector{T}, | ||
| vecs::Tuple, | ||
| nonmissinginds::AbstractVector{<:Integer}, | ||
| nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} |
There was a problem hiding this comment.
| function spread_missing( | |
| res::AbstractVector{T}, | |
| vecs::Tuple, | |
| nonmissinginds::AbstractVector{<:Integer}, | |
| nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} | |
| function spread_missing(res::AbstractVector{T}, | |
| vecs::Tuple, | |
| nonmissinginds::AbstractVector{<:Integer}, | |
| nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} |
Same below (for consistency with current style of the package).
| nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} | ||
|
|
||
| if length(res) != length(nonmissinginds) | ||
| s = "When spreading a vector result with `spread=$(S)`, " * |
There was a problem hiding this comment.
S doesn't exist here. Apparently also needs testing.
| maybespread_missing(f, newargs, new_kwargs, vecs, nonmissinginds, nonmissingmask) | ||
| # There is at least one vector, but none of the vectors can contain missing | ||
| elseif any(x -> x isa AbstractVector, xs) | ||
| vecs = Base.filter(x -> x isa AbstractVector, xs) |
| julia> xmiss = [10, 20, 30, missing]; | ||
|
|
||
| julia> ymiss = [missing, 500, 400, 300]; |
There was a problem hiding this comment.
Why not reuse the previous objects?
| if !(spread isa AbstractSpread) | ||
| throw(ArgumentError("spread must be either SpreadDefault(), SpreadNonMissing(), or SpreadNone()")) | ||
| end |
There was a problem hiding this comment.
This cannot happen.
| if !(spread isa AbstractSpread) | |
| throw(ArgumentError("spread must be either SpreadDefault(), SpreadNonMissing(), or SpreadNone()")) | |
| end |
| 2. `fillmean_skip` returns a vector which does not allow for `missing`, while | ||
| `spreadmissings(fillmean)` does. | ||
|
|
||
| Use the keyword `spread = :all` to emulate the `skipmissing` behavior. |
| @@ -0,0 +1,227 @@ | |||
| using Test, SparseArrays, Missings | |||
| function right_vec(args...; kwargs...) | ||
| kwargs_vals = values(values(kwargs)) | ||
| xs = tuple(args..., kwargs_vals...) | ||
| vecs = Base.filter(x -> x isa AbstractVector, xs) |
There was a problem hiding this comment.
| vecs = Base.filter(x -> x isa AbstractVector, xs) | |
| vecs = filter(x -> x isa AbstractVector, xs) |
| @test t ≈ categorical([1, 2, 3, missing]) | ||
| @test typeof(t) == typeof(categorical([1, 2, 3, missing])) | ||
| end | ||
|
|
There was a problem hiding this comment.
Can you add tests that inference works?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
| P = typeof(a) # Type of parent array | ||
| I = Tuple{typeof(nonmissinginds)} # Type of the non-missing indices | ||
| L = Base.IndexStyle(a) === IndexLinear # If the type supports fast linear indexing | ||
| SubArray{T, N, P, I, L}(a, (nonmissinginds,), 0, 1) |
There was a problem hiding this comment.
The docstring for SubArray makes no mention of contructors, and in fact says "Construct SubArrays using the view function." So this direct usage of a SubArray constructor is usage of undocumented Base internals that could change in any minor or patch release.
There was a problem hiding this comment.
Yeah, that's probably true.
I think it will be worthwhile to make our own array type that does this better... hopefully it's not too hard.
There was a problem hiding this comment.
Can be relevant – my SentinelViews.jl package does kinda the opposite. It defines a view type that propagates the sentinel value from indices to the results, sentinelview([10, 20, 30], [1, nothing, 3]) == [10, nothing, 30].
It was reasonably straightforward to implement, as should be the "typesubtract view" needed here. With the obvious downside that ::SubArray function dispatches won't be triggered.
Adds behavior described in #121
Implementation is easy enough to follow