The Problem With t-SNE and UMAP
t-SNE and UMAP are the workhorses of high-dimensional visualisation. They produce beautiful, cluster-separated scatter plots that have become ubiquitous in machine learning papers, bioinformatics, and data exploration. But they share a fundamental limitation: O(N²) pairwise comparisons.
For a dataset of 10,000 points in 50 dimensions, that means 100 million distance calculations — before the iterative optimisation even begins. In a Jupyter notebook, this is a minor annoyance. In a browser-based tool where users expect immediate feedback, it is a hard wall.
Both methods are also stochastic: run them twice, get different layouts. They distort global geometry to preserve local neighborhoods. And they provide no mechanism for embedding new points without re-running the entire computation.
The Core Idea: Landmarks, Not Pairwise
Instead of comparing every point to every other point, SLR compares each point to a small, fixed set of landmarks. For N points and k landmarks (typically ~50), this reduces complexity from O(N²) to O(N × k) — effectively linear when k is small.
The key insight is that you don't need all pairwise relationships to reconstruct a useful low-dimensional layout. If you know how far each point is from a well-placed set of reference points, you can triangulate its position analytically.
Two Landmark Strategies
SLR supports two approaches to choosing landmarks:
Synthetic Sine Skeleton. Generates deterministic landmarks using independent sine waves per dimension. Each landmark coordinate is defined by:
where amplitude a, frequency ω, and phase φ
are drawn uniformly. This produces smooth, well-distributed coverage through high-dimensional
space — completely independent of the data. The same landmarks work for any dataset.
Data-Derived Landmarks. Selects k random points from the dataset itself. Uses hybrid normalisation: raw selection preserves cluster structure, normalised computation ensures fair feature weighting.
The Algorithm
SLR runs in four phases. Each is non-iterative and has a closed-form solution.
Phase 1 Landmark Projection
Project the k high-dimensional landmarks into the target dimensionality (2D or 3D) using PCA. Then scale the projected coordinates so that the geometric spread in low-D matches the spread in high-D:
This preserves the relative scale of the landmark constellation across dimensions.
Phase 2 Linearised Trilateration
For each data point x, compute the squared distance δ²j
to each landmark j. The ideal embedding satisfies:
This is non-linear in y. But subtracting the equation for landmark 0 from all
others eliminates the non-linear yTy term:
This yields a linear system Ay = b, solved via the Moore-Penrose pseudoinverse.
The matrix A+ is precomputed once for all points — making the per-point
cost a simple matrix multiply.
For a batch of N points:
- Compute k squared distances per point: O(N × k)
- Matrix multiply by precomputed A+: O(N × k × m)
- Total: linear in N
Phase 3 Alpha Refinement
A global scalar α corrects residual scale mismatches between the high-D distances and the low-D embedding:
The trilateration is then recomputed with corrected distances:
δ²_corrected = α² × δ².
This is a single pass — no iteration.
Phase 4 Distance Warping
A tunable parameter p interpolates between global geometry and local structure:
At p = 1.0, the embedding preserves global distances faithfully. At
p ≈ 0.33, it produces t-SNE-like cluster separation — compressing
large distances and expanding small ones. The user controls the trade-off with a single slider.
Implementation
The core class in Python:
import numpy as np
class SineLandmarkReduction:
def __init__(self, n_components=2, n_landmarks=50,
random_state=42, synthetic_landmarks=False):
self.n_components = n_components
self.k = n_landmarks
self.synthetic_landmarks = synthetic_landmarks
self.rng = np.random.RandomState(random_state)
def _gamma(self, t, a, omega, phi):
"""Synthetic sine path: γ(t) = a·sin(ω·t + φ)"""
return a[:, None] * np.sin(
omega[:, None] * t + phi[:, None]
)
The embedding pipeline:
- Standardise input features (zero mean, unit variance)
- Generate synthetic landmarks or select k random data points
- PCA-project landmarks to target dimensions (2D or 3D)
- Precompute
Aand its pseudoinverseA+ - For each batch: compute distances → build
b→ solveAy = b - Apply alpha refinement
- Return low-dimensional coordinates
Usage:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=5000, n_features=20,
centers=5, cluster_std=0.80,
random_state=42)
slr = SineLandmarkReduction(
n_components=2, n_landmarks=50,
random_state=42, synthetic_landmarks=False
)
Y = slr.fit_transform(X)
How It Compares
| Property | t-SNE | UMAP | SLR |
|---|---|---|---|
| Complexity | O(N²) or O(N log N) | O(N¹⋅&sup7;) | O(N × k) |
| Deterministic | No | No | Yes |
| Global geometry | Distorted | Partially preserved | Preserved (tunable) |
| Iterative | Yes (gradient descent) | Yes (SGD) | No (analytic) |
| Out-of-sample | No (requires re-run) | Partial (parametric UMAP) | Yes (precomputed A+) |
| Local clusters | Excellent | Excellent | Good (tunable via p) |
| Browser-native | Impractical | Impractical | Yes |
t-SNE produces stronger local cluster separation — that is its core strength. SLR trades some
local resolution for determinism, speed, and global geometry preservation. The warping parameter
p lets you slide between these trade-offs: lower values produce more t-SNE-like
layouts, higher values preserve faithful distances.
Why This Matters for the Browser
The entire algorithm requires only basic linear algebra and a pseudoinverse — operations that translate directly to JavaScript or WebAssembly. There are no gradient calculations, no learning rate schedules, no convergence criteria to tune.
This makes SLR uniquely suited for interactive data exploration in the browser. Users upload a CSV, and the visualisation appears in under two seconds — not after a loading spinner and a background computation.
SLR powers the 3D scatter plot in CSV X-Ray, Thingbook's drag-and-drop anomaly detection tool. When you upload a multi-feature dataset, the 3D embedding you see is computed client-side using this algorithm.
Limitations and Trade-offs
- Local resolution. For datasets where local neighborhood structure is the primary concern (e.g., single-cell RNA-seq clustering), t-SNE or UMAP will produce tighter, more informative clusters.
- Landmark placement. The quality of the embedding depends on landmark coverage. For extremely non-uniform data distributions, data-derived landmarks may leave sparse regions under-represented.
- No learned representation. SLR is a geometric method, not a learned one. It does not capture non-linear manifold structure the way neural approaches (e.g., autoencoders) can.