add and benchmark typed_hvcat(SA, ::Val, ...)#811
add and benchmark typed_hvcat(SA, ::Val, ...)#811simeonschaub wants to merge 1 commit intoJuliaArrays:masterfrom
Conversation
to explore the benefits of JuliaLang/julia#36719
|
Interesting, but I'd expect constant propagation to do the same here in most circumstances. Specifically, the In fact I thought I tested this exact thing in the original Anyway, consider the following less abstract version of your test case for 3x3 which gives 0 allocations on julia-1.4 and which the non-SA version seems to perform exactly the same: julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r = SA[0 0 0; 0 0 0; 0 0 0]
for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r += SA[i1 i2 i3; i4 i5 i6; i7 i8 i9]
end
r
end
foo (generic function with 2 methods)
julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r = SMatrix{3,3}((0,0,0, 0,0,0, 0,0,0))
for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r += SMatrix{3,3}((i1,i4,i7, i2,i5,i8, i3,i6,i9))
end
r
end
bar (generic function with 1 method)
julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
695.789 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
768 768 768
768 768 768
768 768 768
julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
695.946 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
768 768 768
768 768 768
768 768 768Can you reproduce this? How does this reconcile with the numbers you're getting? |
|
Yes, I should probably have described my methodology here better. I am seeing the same result as you for 3x3, but for sizes larger than 3x4, I do see these improvements. I benchmarked various sizes in this git here: https://gist.github.com/simeonschaub/fb6eff0d212f8514ecec186a69712a82. (The |
|
Right, 4x4 being slower makes some sense. I think we should figure out what's going on here and whether some minor rearrangement (eg careful use of As pointed out by Jeff in JuliaLang/julia#36719 it would be preferable to avoid making things harder for the compiler and I feel like this is a good case for relying on constant propagation. Let's see :) |
|
A suspicious thing here is that both of your perf/hvcat_val.jl examples still allocate, even though they really shouldn't. I've got a suspicion that the thing that's hardest on the compiler here may be the use of julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
end
r
end
foo (generic function with 2 methods)
julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r = SMatrix{4,4}((0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0))
for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r += SMatrix{4,4}((i1, i2, i3, i4, i5,i6,i7,i8, i9,i10,i11,i12, i13,i14,i15,i16))
end
r
end
bar (generic function with 1 method)
julia> function baz(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
for i1=x1, i2=x2, i3=x3, i4=x4, i5=x5, i6=x6, i7=x7, i8=x8, i9=x9, i10=x10, i11=x11, i12=x12, i13=x13, i14=x14, i15=x15, i16=x16
r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
end
r
end
baz (generic function with 1 method)
julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
491.958 ms (524288 allocations: 69.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
209.902 ms (655360 allocations: 94.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
julia> @btime baz(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
143.984 μs (0 allocations: 0 bytes)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304 |
|
Oh, interesting! Do you agree, that the use of @inline function Base.typed_hvcat(::Type{SA}, alt::T, i...) where {T<:Tuple}
Base.typed_hvcat(SA, Val{alt}(), i...)
endI had pretty good success using something similar to this to call a specialized generated function in CoolTensors.jl. (forcing |
Maybe, but I'm not sure why? Arguably it might be safer if I didn't use
I didn't think this was meant to affect this case... but perhaps it does! I'd be interested if it makes a difference. |
to explore the benefits of JuliaLang/julia#36719.
For constructing a 4x4 SMatrix in a loop, as shown here in
perf/hvcat_val.jl, I get the following timings on my machine: