add and benchmark typed_hvcat(SA, ::Val, ...) by simeonschaub · Pull Request #811 · JuliaArrays/StaticArrays.jl

simeonschaub · 2020-07-18T10:02:49Z

to explore the benefits of JuliaLang/julia#36719.

For constructing a 4x4 SMatrix in a loop, as shown here in perf/hvcat_val.jl, I get the following timings on my machine:

julia> include("perf/hvcat_val.jl")
BenchmarkTools.Trial: 
  memory estimate:  69.00 MiB
  allocs estimate:  524288
  --------------
  minimum time:     324.084 ms (0.40% GC)
  median time:      325.538 ms (0.68% GC)
  mean time:        327.568 ms (0.62% GC)
  maximum time:     351.516 ms (0.41% GC)
  --------------
  samples:          16
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  49.00 MiB
  allocs estimate:  327680
  --------------
  minimum time:     87.512 ms (0.98% GC)
  median time:      88.072 ms (1.22% GC)
  mean time:        89.065 ms (1.31% GC)
  maximum time:     113.838 ms (0.91% GC)
  --------------
  samples:          57
  evals/sample:     1

to explore the benefits of JuliaLang/julia#36719

c42f · 2020-07-20T04:24:18Z

Interesting, but I'd expect constant propagation to do the same here in most circumstances. Specifically, the rows should always be a constant literal tuple because it comes from interpreting syntax and the compiler can constant propagate it.

In fact I thought I tested this exact thing in the original SA PR, or I wouldn't have merged it! Perhaps it fails at larger matrix sizes?

Anyway, consider the following less abstract version of your test case for 3x3 which gives 0 allocations on julia-1.4 and which the non-SA version seems to perform exactly the same:

julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9)
           r = SA[0 0 0; 0 0 0; 0 0 0]
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
               r += SA[i1 i2 i3; i4 i5 i6; i7 i8 i9]
           end
           r
       end
foo (generic function with 2 methods)

julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9)
           r = SMatrix{3,3}((0,0,0, 0,0,0, 0,0,0))
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
               r += SMatrix{3,3}((i1,i4,i7, i2,i5,i8, i3,i6,i9))
           end
           r
       end
bar (generic function with 1 method)

julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  695.789 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
 768  768  768
 768  768  768
 768  768  768

julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  695.946 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
 768  768  768
 768  768  768
 768  768  768

Can you reproduce this? How does this reconcile with the numbers you're getting?

simeonschaub · 2020-07-20T07:58:15Z

Yes, I should probably have described my methodology here better. I am seeing the same result as you for 3x3, but for sizes larger than 3x4, I do see these improvements. I benchmarked various sizes in this git here: https://gist.github.com/simeonschaub/fb6eff0d212f8514ecec186a69712a82. (The (4, 2) case is really weird, because I checked that they produced the same @code_native and this difference goes away, if I change the order I call the two functions in. My best guess is that this is due to an invalid use of @pure in the current implementation, which would probably also be an argument against relying too much on @pure here.)
I do think 4x4 is still a reasonable size to use this constructor for, so I think this would still be a worthwhile improvement, but I was actually surprised how well constant prop worked for the smaller sizes.

c42f · 2020-07-21T05:44:03Z

Right, 4x4 being slower makes some sense.

I think we should figure out what's going on here and whether some minor rearrangement (eg careful use of @inline or @generated in the right place) could encourage the compiler to produce better code.

As pointed out by Jeff in JuliaLang/julia#36719 it would be preferable to avoid making things harder for the compiler and I feel like this is a good case for relying on constant propagation. Let's see :)

c42f · 2020-07-21T10:56:30Z

A suspicious thing here is that both of your perf/hvcat_val.jl examples still allocate, even though they really shouldn't.

I've got a suspicion that the thing that's hardest on the compiler here may be the use of Iterators.product with so many fields, rather than SA per se. Yes, they're interacting in a somewhat bad way, but @code_typed doesn't show any type instability. Avoiding Iterators.product improves the performance markedly without needing to remove SA:

julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
               r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
           end
           r
       end
foo (generic function with 2 methods)

julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SMatrix{4,4}((0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0))
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
               r += SMatrix{4,4}((i1, i2, i3, i4, i5,i6,i7,i8, i9,i10,i11,i12, i13,i14,i15,i16))
           end
           r
       end
bar (generic function with 1 method)

julia> function baz(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
           for i1=x1, i2=x2, i3=x3, i4=x4, i5=x5, i6=x6, i7=x7, i8=x8, i9=x9, i10=x10, i11=x11, i12=x12, i13=x13, i14=x14, i15=x15, i16=x16
               r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
           end
           r
       end
baz (generic function with 1 method)

julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  491.958 ms (524288 allocations: 69.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  209.902 ms (655360 allocations: 94.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

julia> @btime baz(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  143.984 μs (0 allocations: 0 bytes)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

simeonschaub · 2020-07-21T11:53:03Z

Oh, interesting! Do you agree, that the use of @pure is also problematic here? One thing I still want to try is something like

@inline function Base.typed_hvcat(::Type{SA}, alt::T, i...) where {T<:Tuple}
    Base.typed_hvcat(SA, Val{alt}(), i...)
end

I had pretty good success using something similar to this to call a specialized generated function in CoolTensors.jl. (forcing T to be specialized on seems to be important here)

c42f · 2020-07-21T12:20:02Z

Do you agree, that the use of @pure is also problematic here?

Maybe, but I'm not sure why? Arguably it might be safer if I didn't use any inside of it (or getindex, or != ?!), but it would be some pretty severe type piracy for someone to change the definition of any of those methods for integers and tuples.

forcing T to be specialized on seems to be important here

I didn't think this was meant to affect this case... but perhaps it does! I'd be interested if it makes a difference.

add and benchmark typed_hvcat(SA, ::Val, ...)

e52f495

to explore the benefits of JuliaLang/julia#36719

simeonschaub mentioned this pull request Jul 18, 2020

RFC: lower T[a b; c d] to typed_hvcat(T, Val((2, 2)), a, b, c, d) JuliaLang/julia#36719

Closed

simeonschaub marked this pull request as draft July 18, 2020 10:10

mateuszbaran mentioned this pull request Mar 23, 2021

Make hvcat for StaticArray and UniformScaling inputs produce a StaticArray #888

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add and benchmark typed_hvcat(SA, ::Val, ...)#811

add and benchmark typed_hvcat(SA, ::Val, ...)#811
simeonschaub wants to merge 1 commit intoJuliaArrays:masterfrom
simeonschaub:hvcat_val

simeonschaub commented Jul 18, 2020

Uh oh!

c42f commented Jul 20, 2020 •

edited

Loading

Uh oh!

simeonschaub commented Jul 20, 2020

Uh oh!

c42f commented Jul 21, 2020

Uh oh!

c42f commented Jul 21, 2020

Uh oh!

simeonschaub commented Jul 21, 2020

Uh oh!

c42f commented Jul 21, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simeonschaub commented Jul 18, 2020

Uh oh!

c42f commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simeonschaub commented Jul 20, 2020

Uh oh!

c42f commented Jul 21, 2020

Uh oh!

c42f commented Jul 21, 2020

Uh oh!

simeonschaub commented Jul 21, 2020

Uh oh!

c42f commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

c42f commented Jul 20, 2020 •

edited

Loading

c42f commented Jul 21, 2020 •

edited

Loading