fix(onnx): honest doctor report + default to the CPU provider on Apple Silicon#30
Merged
Conversation
…antee doctor reported `runtime (auto): onnx` whenever onnxruntime was importable, but `embedding_runtime="auto"` only uses ONNX when the model's ONNX graph can also be materialized (cached, a prebuilt Hub snapshot, or an Optimum export), and otherwise falls back to PyTorch. On an offline, uncached box the report therefore overstated the runtime. Report the preferred runtime plus an explicit fallback note instead, and rename the JSON field auto_resolves_to to preferred. Testing: ruff check; pytest (test_onnx_embedding_runtime, test_local_backends) 23 passed.
…re ML opt-in) The default device "auto" resolves to "mps" on Apple Silicon, and the provider mapping requested the ONNX Core ML EP there. That path was never benchmarked (all #29 numbers and the parity test ran on the CPU EP). On the dynamic-shape preset graphs Core ML fragments into ~50 Core ML/CPU partitions and measured slower than the plain CPU provider for single-query embedding: about 16 ms vs 3 ms on an M-series CPU (torch is 7.9 ms), so the default Mac path was the slowest of the three despite the ONNX-default switch being motivated by lower query latency. Parity is unaffected (cosine 1.0 vs both torch and CPU-ONNX). Default the ONNX runtime to the CPU provider on Apple Silicon and gate Core ML behind LODEDB_ONNX_COREML=1, mirroring the off-by-default opt-in MPS vector scan. CUDA hosts still prefer the CUDA provider. Testing: ruff check; pytest 403 passed, 35 skipped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two post-merge fixes found while auditing the #29 ONNX integration on Apple Silicon, where the default
device="auto"resolves tomps.1. Default to the CPU execution provider on Apple Silicon (Core ML is opt-in). The provider mapping requested the ONNX Core ML EP for
mps, but that path was never benchmarked (all #29 numbers and the parity test ran on the CPU EP). Measured on an M-series CPU, Core ML is the slowest option for single-query embedding, because the dynamic-shape preset graph fragments into ~50 Core ML/CPU partitions:So the default Mac path was ~2x slower than the torch it replaced, defeating the latency win that motivated the ONNX default. Parity is unaffected (cosine 1.0000 vs both torch and CPU-ONNX). Fix: ONNX defaults to the CPU provider on Apple Silicon; Core ML is opt-in via
LODEDB_ONNX_COREML=1, mirroring the off-by-default opt-in MPS vector scan (a slower-than-CPU Apple-GPU path). CUDA hosts still prefer the CUDA provider.2.
doctorreports the embedding runtime as a preference, not a guarantee. It printedruntime (auto): onnxwheneveronnxruntimewas importable, butautofalls back to torch when the model's ONNX graph can't be materialized (offline/uncached). Now reports the preferred runtime plus an explicit fallback note; JSON field renamedauto_resolves_totopreferred, addednote.Testing
ruff checkclean; fullpytest403 passed, 35 skipped.LODEDB_ONNX_COREMLopt-in) and the doctor test now asserts both the JSON and rendered-text surfaces.