Why Reduce Dimensions?
- Visualization: Can't plot 768 dimensions
- Efficiency: Smaller vectors = faster processing
- Noise reduction: Remove redundant information
- Understanding: See clusters and patterns
Common Techniques
PCA (Principal Component Analysis)
- Linear transformation
- Preserves global structure
- Fast, deterministic
- Good for preprocessing
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)
t-SNE
- Non-linear
- Great for visualization
- Preserves local structure
- Slow on large datasets
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30)
reduced = tsne.fit_transform(embeddings)
UMAP
- Best of both worlds
- Preserves global AND local structure
- Faster than t-SNE
- Modern go-to choice
import umap
reducer = umap.UMAP(n_components=2)
reduced = reducer.fit_transform(embeddings)
Comparison
| Method | Speed | Global Structure | Local Structure | Best For |
|---|---|---|---|---|
| PCA | Fast | Good | Poor | Preprocessing |
| t-SNE | Slow | Poor | Excellent | Visualization |
| UMAP | Medium | Good | Good | General use |
Visualization Example
import matplotlib.pyplot as plt
import umap
# Reduce to 2D
reducer = umap.UMAP(n_components=2)
coords = reducer.fit_transform(embeddings)
# Plot
plt.scatter(coords[:, 0], coords[:, 1], c=labels)
plt.title("Embedding Visualization")
plt.show()
Caution: Reduced representations lose information. Only use for visualization and analysis, not for similarity search.
Tips
- Use PCA first to reduce to ~50 dimensions, then t-SNE/UMAP
- UMAP hyperparameters: n_neighbors and min_dist affect clustering
- t-SNE perplexity affects cluster tightness
- Color points by category to see if embeddings capture semantics