Skip to content

cuML Barnes-Hut t-SNE collapses to a straight line on homogeneous embeddings #40

@NetZissou

Description

@NetZissou

cuML’s t-SNE supports three algorithms: the original exact algorithm (O(N^2)), the Barnes-Hut approximation and the fast Fourier transform interpolation approximation (O(N log N)). The latter two are derived from CannyLabs’ open-source CUDA code and produce extremely fast embeddings when n_components = 2. The exact algorithm is more accurate, but too slow to use on large datasets.

In the embed_explore / precalculated apps, t-SNE projection with the cuML (GPU) backend renders as a straight 45° line for the Darwin's-finches BioCLIP 2 embeddings. PCA & UMAP are fine, and switching the backend to sklearn produces a correct t-SNE, therefore this is specific to cuML's Barnes-Hut t-SNE, not the data or our pipeline.

The app uses cuML t-SNE's default method='barnes_hut', which collapses both output dimensions onto one axis on this data. method='exact' fixes it. Barnes-Hut's degeneracy is data-dependent. The finch embeddings are extremely homogeneous, near-uniform pairwise structure after L2-normalization.

Degenerate / collapsed embedding: 2D output isn't a real spread but lies on a single line (one output axis carries ~all the variance; the two coordinates become perfectly correlated)

cuML t-SNE result

Image

cuML PCA result

Image

cuML UMAP result

Image

sklearn t-SNE result

Image

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions