Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .github/workflows/install-and-import.yaml
Comment thread
egrace479 marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: install + import

# CI matrix: verify the package installs cleanly and core modules
# import on every supported (OS, Python) pair, per pkg pyproject.toml's
# requires-python = ">=3.10,<3.14".
on:
pull_request:
workflow_dispatch: # allow manual reruns from the Actions tab

jobs:
install-and-import:
name: ${{ matrix.os }} / py${{ matrix.python-version }}
runs-on: ${{ matrix.os }}
strategy:
# Don't abort the whole matrix on a single failure so we can see all platforms.
fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ['3.10', '3.11', '3.12', '3.13']
steps:
- uses: actions/checkout@v4

# uv is our package manager of record, mirrors what users do locally.
- uses: astral-sh/setup-uv@v6

- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

# CPU-only install. gpu-cu12 / gpu-cu13 extras are skipped on CI runners
# (no NVIDIA hardware, CUDA wheels are large and slow to resolve).
- name: Install package
run: uv pip install --system -e .

# Smoke-test that the modules that get patched or accelerated import
# cleanly, and that both Streamlit app entry points are importable.
- name: Import smoke test
run: |
python -c "import shared.utils.clustering"
python -c "import apps.precalculated.app"
python -c "import apps.embed_explore.app"

# On x86_64 / AMD64 sklearnex must install (the platform marker in
# pyproject.toml ensures it). On macos-latest (arm64) it must NOT
# install, which is the whole point of the marker.
- name: Verify sklearnex matches platform marker
shell: python
run: |
import platform
import importlib.util
on_x86 = platform.machine() in ('x86_64', 'AMD64')
present = importlib.util.find_spec('sklearnex') is not None
if on_x86:
assert present, 'sklearnex should be installed on x86_64/AMD64'
import sklearnex
print(f'sklearnex {sklearnex.__version__} present on {platform.machine()}')
else:
assert not present, f'sklearnex must not install on {platform.machine()}'
print(f'sklearnex correctly absent on {platform.machine()}')
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ uv pip install -e ".[gpu-cu12]"
uv pip install -e ".[gpu-cu13]"
```

The app auto-detects GPU availability at runtime and falls back to CPU if anything goes wrong — no configuration needed. You can also manually select backends (cuML, FAISS, sklearn) in the sidebar.
The app auto-detects GPU availability at runtime and falls back to CPU if anything goes wrong — no configuration needed. The CPU sklearn path is auto-accelerated by [scikit-learn-intelex](https://github.com/uxlfoundation/scikit-learn-intelex)[^1]. You can also manually select backends (`cuML`, `sklearn`) in the sidebar.

## Usage

Expand Down Expand Up @@ -96,3 +96,5 @@ ssh -N -L 8501:<COMPUTE_NODE>:8501 <USER>@<LOGIN_NODE>
## Acknowledgements

[OpenCLIP](https://github.com/mlfoundations/open_clip) | [Streamlit](https://streamlit.io/) | [Altair](https://altair-viz.github.io/)

[^1]: [`sklearn-intelex`](https://github.com/uxlfoundation/scikit-learn-intelex) is powered by the [oneDAL](https://github.com/uxlfoundation/oneDAL) library that provides accelerations on x86_64 Linux and Windows machines, and silently fall back to vanilla `sklearn` on unsupported architectures like Apple Silicon and ARM Linux. The package is under the [UXL Foundation](https://github.com/uxlfoundation) (a Linux Foundation project) so cross-vendor support is a stated goal.
20 changes: 10 additions & 10 deletions docs/BACKEND_PIPELINE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,18 @@ Raw Embeddings (from parquet or model)
├─ L2 Normalize: project onto unit hypersphere
├─► Step 1: KMeans Clustering (high-dimensional)
│ Backend: cuML → FAISS → sklearn
│ Backend: cuML (GPU) → sklearn (CPU, auto-accelerated by `sklearn-intelex`)
├─► Step 2: Dimensionality Reduction to 2D
│ Method: PCA / t-SNE / UMAP
│ Backend: cuML → sklearn
│ Backend: cuML (GPU) → sklearn (CPU, auto-accelerated by `sklearn-intelex` for PCA/TSNE)
└─► Scatter Plot (Altair)
Color = cluster, position = 2D projection
```

Note that `sklearn-intelex` acceleration is used for CPU operations where available[^1].

## Step 0: Embedding Preparation

Before any computation, every embedding goes through `_prepare_embeddings()`:
Expand All @@ -46,10 +48,9 @@ feature space, not a lossy 2D projection.
| Backend | When It's Used | How It Works |
|---------|---------------|--------------|
| **cuML** | GPU available + >500 samples | GPU-accelerated KMeans via RAPIDS. Runs on CuPy arrays. Falls back to sklearn on any error. |
| **FAISS** | No GPU + >500 samples | Facebook's optimized CPU KMeans using L2 index. Fast for medium datasets. Falls back to sklearn on error. |
| **sklearn** | Small datasets or fallback | Standard scikit-learn KMeans. Always works, no special dependencies. |
| **sklearn** | CPU path (default on machines without a GPU) | Standard scikit-learn KMeans, auto-accelerated by [scikit-learn-intelex](https://github.com/uxlfoundation/scikit-learn-intelex) (Intel oneDAL) when installed[^1] — typically 10–17× faster than vanilla sklearn on CPU. Disable with `EMB_EXPLORER_DISABLE_SKLEARNEX=1`. |

**Auto-selection priority:** cuML > FAISS > sklearn. You can override in the sidebar.
**Auto-selection priority:** cuML > sklearn. You can override in the sidebar.

## Step 2: Dimensionality Reduction

Expand Down Expand Up @@ -96,8 +97,8 @@ When you select "auto" (the default), the app picks the fastest available backen

| Operation | Auto Logic |
|-----------|-----------|
| KMeans | cuML if GPU + >500 samples, else FAISS if available + >500 samples, else sklearn |
| Dim. Reduction | cuML if GPU + >5000 samples, else sklearn |
| KMeans | cuML if GPU + >500 samples, else sklearn (auto-accelerated by `sklearn-intelex` when installed[^1]) |
| Dim. Reduction | cuML if GPU + >5000 samples, else sklearn (auto-accelerated by sklearn-intelex for PCA / t-SNE) |

Any GPU error (architecture mismatch, missing libraries, out of memory (OOM)) triggers an
automatic retry with sklearn. OOM errors are surfaced to the user with guidance.
Expand All @@ -122,11 +123,10 @@ Check the log file for the full picture when debugging.
cuML (GPU)
│ error?
FAISS (CPU, optimized) ← KMeans only
│ error?
sklearn (CPU, always works)
```

The app is designed to *always produce a result*. GPU acceleration is a
nice-to-have, never a hard requirement.

[^1]: [`sklearn-intelex`](https://github.com/uxlfoundation/scikit-learn-intelex) is powered by the [oneDAL](https://github.com/uxlfoundation/oneDAL) library that provides accelerations on x86_64 Linux and Windows machines, and silently fall back to vanilla `sklearn` on unsupported architectures like Apple Silicon and ARM Linux. The package is under the [UXL Foundation](https://github.com/uxlfoundation) (a Linux Foundation project) so cross-vendor support is a stated goal.
15 changes: 7 additions & 8 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,11 @@ dependencies = [
"altair>=5.0.0",
# Machine learning
"scikit-learn>=1.0.0",
# Intel oneDAL acceleration for sklearn (PCA / TSNE / KMeans) auto-patched at runtime.
# Disable with EMB_EXPLORER_DISABLE_SKLEARNEX=1 if you need vanilla sklearn behavior for debugging.
"scikit-learn-intelex>=2025.0; platform_machine == 'AMD64' or platform_machine == 'x86_64'",
"umap-learn>=0.5.0",
"numba>=0.57.0",
"faiss-cpu>=1.7.0",
"numba>=0.57.0",
# Vision-language models
"open-clip-torch>=2.20.0",
# Custom inference package
Expand All @@ -69,20 +71,17 @@ gpu = [
]
gpu-cu12 = [
"torch>=2.0.0",
"cuml-cu12>=25.6",
"faiss-gpu-cu12>=1.11.0",
"cuml-cu12>=26.4", # 26.4 removed the sklearn upper bound (compatible with sklearn>=1.8)
"pynvml>=11.0.0",
]
gpu-cu13 = [
"torch>=2.0.0",
"cuml-cu13>=25.12",
"faiss-gpu-cu12>=1.11.0", # no cu13 build on PyPI; cu12 works via CUDA backward compat
"cuml-cu13>=26.4", # 26.4 removed the sklearn upper bound (compatible with sklearn>=1.8)
"pynvml>=11.0.0",
]
# Minimal GPU support (just PyTorch + FAISS GPU, no RAPIDS)
# Minimal GPU support for image embeddings generation (just PyTorch, no RAPIDS)
gpu-minimal = [
"torch>=2.0.0",
"faiss-gpu-cu12>=1.11.0",
]
all = [
"emb-explorer[dev,gpu]",
Expand Down
9 changes: 2 additions & 7 deletions shared/components/clustering_controls.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import streamlit as st
from typing import Tuple, Optional

from shared.utils.backend import HAS_FAISS_PACKAGE, HAS_CUML_PACKAGE, HAS_CUPY_PACKAGE
from shared.utils.backend import HAS_CUML_PACKAGE, HAS_CUPY_PACKAGE


def render_clustering_backend_controls():
Expand All @@ -19,9 +19,6 @@ def render_clustering_backend_controls():
dim_reduction_options = ["auto", "sklearn"]
clustering_options = ["auto", "sklearn"]

if HAS_FAISS_PACKAGE:
clustering_options.append("faiss")

if HAS_CUML_PACKAGE and HAS_CUPY_PACKAGE:
dim_reduction_options.append("cuml")
clustering_options.append("cuml")
Expand Down Expand Up @@ -73,7 +70,7 @@ def render_clustering_backend_controls():
max_value=64,
value=8,
step=1,
help="Number of parallel workers for CPU backends (sklearn, FAISS). Not used by cuML (GPU manages parallelization automatically)."
help="Number of parallel workers for CPU sklearn. Not used by cuML (GPU manages parallelization automatically)."
)


Expand Down Expand Up @@ -118,8 +115,6 @@ def render_kmeans_controls():
Tuple of (clustering_backend, n_workers, seed)
"""
clustering_options = ["auto", "sklearn"]
if HAS_FAISS_PACKAGE:
clustering_options.append("faiss")
if HAS_CUML_PACKAGE and HAS_CUPY_PACKAGE:
clustering_options.append("cuml")

Expand Down
4 changes: 2 additions & 2 deletions shared/services/clustering_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ def run_clustering(
n_clusters: Number of clusters
reduction_method: Dimensionality reduction method
n_workers: Number of workers for reduction
dim_reduction_backend: Backend for dimensionality reduction ("auto", "sklearn", "faiss", "cuml")
clustering_backend: Backend for clustering ("auto", "sklearn", "faiss", "cuml")
dim_reduction_backend: Backend for dimensionality reduction ("auto", "sklearn", "cuml")
clustering_backend: Backend for clustering ("auto", "sklearn", "cuml")
seed: Random seed for reproducibility (None for random)

Returns:
Expand Down
2 changes: 1 addition & 1 deletion shared/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Shared utilities for clustering, IO, models, and taxonomy.

Modules are imported lazily to avoid pulling in heavy dependencies
(sklearn, umap, faiss, cuml, torch, open_clip) at startup.
(sklearn, umap, cuml, torch, open_clip) at startup.
Use direct imports instead:

from shared.utils.clustering import reduce_dim, run_kmeans
Expand Down
21 changes: 2 additions & 19 deletions shared/utils/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@
# These are safe to call at module-load / render time — they only check
# whether the package is installed, without executing it.

HAS_FAISS_PACKAGE: bool = importlib.util.find_spec("faiss") is not None
HAS_CUML_PACKAGE: bool = importlib.util.find_spec("cuml") is not None
HAS_CUPY_PACKAGE: bool = importlib.util.find_spec("cupy") is not None
HAS_TORCH_PACKAGE: bool = importlib.util.find_spec("torch") is not None
Expand Down Expand Up @@ -84,42 +83,27 @@ def check_cuml_available() -> bool:
return False


def check_faiss_available() -> bool:
"""Check if FAISS is available (actual import, for runtime use)."""
if not HAS_FAISS_PACKAGE:
return False
try:
import faiss
return True
except ImportError:
return False


def resolve_backend(backend: str, operation: str = "general") -> str:
"""
Resolve 'auto' backend to actual backend based on available hardware.

Args:
backend: Requested backend ("auto", "sklearn", "cuml", "faiss")
backend: Requested backend ("auto", "sklearn", "cuml")
operation: Operation type for logging ("clustering", "reduction", "general")

Returns:
Resolved backend name
Resolved backend name. CPU paths always go through sklearn.
"""
if backend != "auto":
logger.debug(f"Using explicitly requested backend: {backend}")
return backend

cuda_available, device_info = check_cuda_available()
has_cuml = check_cuml_available()
has_faiss = check_faiss_available()

if cuda_available and has_cuml:
resolved = "cuml"
logger.info(f"Auto-resolved {operation} backend to cuML (GPU: {device_info})")
elif has_faiss:
resolved = "faiss"
logger.info(f"Auto-resolved {operation} backend to FAISS (CPU)")
else:
resolved = "sklearn"
logger.info(f"Auto-resolved {operation} backend to sklearn (CPU)")
Expand All @@ -140,7 +124,6 @@ def get_backend_info() -> dict:
"cuda_available": cuda_available,
"device_info": device_info,
"cuml_available": check_cuml_available(),
"faiss_available": check_faiss_available(),
}


Expand Down
Loading
Loading