When RTM was started, using rsqrt and rcp with Newton-Raphson iterations was faster due to the shorter instruction latencies and the short processor scheduler buffers (e.g. AMD Jaguar era). Today, sqrt/div is typically better as fewer instructions and registers are used and this improves the changes of inlining considerably. Any extra cost from the longer latency instructions can often be hidden by the processor by doing other work thanks to its deeper scheduler buffers.
rsqrt/rcp also have the issue of not being deterministic across vendors. Because they are reduced precision operations, apparently the IEEE/SEE standard doesn't mandate a specific precision for them.
As such, we should remove their usage in favor of sqrt/div where possible for their increased stability and leaner assembly.
When RTM was started, using rsqrt and rcp with Newton-Raphson iterations was faster due to the shorter instruction latencies and the short processor scheduler buffers (e.g. AMD Jaguar era). Today, sqrt/div is typically better as fewer instructions and registers are used and this improves the changes of inlining considerably. Any extra cost from the longer latency instructions can often be hidden by the processor by doing other work thanks to its deeper scheduler buffers.
rsqrt/rcp also have the issue of not being deterministic across vendors. Because they are reduced precision operations, apparently the IEEE/SEE standard doesn't mandate a specific precision for them.
As such, we should remove their usage in favor of sqrt/div where possible for their increased stability and leaner assembly.