Dimensionality Reduction

Reduce high-dimensional vectors for visualization and analysis.

Why Reduce Dimensions?

  • Visualization: Can't plot 768 dimensions
  • Efficiency: Smaller vectors = faster processing
  • Noise reduction: Remove redundant information
  • Understanding: See clusters and patterns

Common Techniques

PCA (Principal Component Analysis)

  • Linear transformation
  • Preserves global structure
  • Fast, deterministic
  • Good for preprocessing
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

t-SNE

  • Non-linear
  • Great for visualization
  • Preserves local structure
  • Slow on large datasets
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30)
reduced = tsne.fit_transform(embeddings)

UMAP

  • Best of both worlds
  • Preserves global AND local structure
  • Faster than t-SNE
  • Modern go-to choice
import umap

reducer = umap.UMAP(n_components=2)
reduced = reducer.fit_transform(embeddings)

Comparison

MethodSpeedGlobal StructureLocal StructureBest For
PCAFastGoodPoorPreprocessing
t-SNESlowPoorExcellentVisualization
UMAPMediumGoodGoodGeneral use

Visualization Example

import matplotlib.pyplot as plt
import umap

# Reduce to 2D
reducer = umap.UMAP(n_components=2)
coords = reducer.fit_transform(embeddings)

# Plot
plt.scatter(coords[:, 0], coords[:, 1], c=labels)
plt.title("Embedding Visualization")
plt.show()
Caution: Reduced representations lose information. Only use for visualization and analysis, not for similarity search.

Tips

  • Use PCA first to reduce to ~50 dimensions, then t-SNE/UMAP
  • UMAP hyperparameters: n_neighbors and min_dist affect clustering
  • t-SNE perplexity affects cluster tightness
  • Color points by category to see if embeddings capture semantics