Dimensionality Reduction: PCA, t-SNE, UMAP

Why Reduce Dimensions?

Visualization: Can't plot 768 dimensions
Efficiency: Smaller vectors = faster processing
Noise reduction: Remove redundant information
Understanding: See clusters and patterns

Common Techniques

PCA (Principal Component Analysis)

Linear transformation
Preserves global structure
Fast, deterministic
Good for preprocessing

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

t-SNE

Non-linear
Great for visualization
Preserves local structure
Slow on large datasets

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30)
reduced = tsne.fit_transform(embeddings)

UMAP

Best of both worlds
Preserves global AND local structure
Faster than t-SNE
Modern go-to choice

import umap

reducer = umap.UMAP(n_components=2)
reduced = reducer.fit_transform(embeddings)

Comparison

Method	Speed	Global Structure	Local Structure	Best For
PCA	Fast	Good	Poor	Preprocessing
t-SNE	Slow	Poor	Excellent	Visualization
UMAP	Medium	Good	Good	General use

Visualization Example

import matplotlib.pyplot as plt
import umap

# Reduce to 2D
reducer = umap.UMAP(n_components=2)
coords = reducer.fit_transform(embeddings)

# Plot
plt.scatter(coords[:, 0], coords[:, 1], c=labels)
plt.title("Embedding Visualization")
plt.show()

Caution: Reduced representations lose information. Only use for visualization and analysis, not for similarity search.

Tips

Use PCA first to reduce to ~50 dimensions, then t-SNE/UMAP
UMAP hyperparameters: n_neighbors and min_dist affect clustering
t-SNE perplexity affects cluster tightness
Color points by category to see if embeddings capture semantics