mx.matmul overhead

Do you have any benchmarks showing where the extra overhead of mx.matmul over a regular matmul is? Is it in the quantization step (calculating scales, rounding, etc.)? If so, do you know if devices with MX support will do this rounding in the hardware itself, and if so, will the overhead become negligible there because of the hardware support?