Cosine Similarity

The go-to metric for measuring how similar two vectors are.

6 min read Popular

Try It: Cosine Similarity Calculator

What is Cosine Similarity?

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It's the cosine of the angle between them.

\[\text{similarity}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| \cdot |\vec{b}|} = \frac{\sum a_i b_i}{\sqrt{\sum a_i^2} \cdot \sqrt{\sum b_i^2}}\]

Value Range

1
Identical
Same direction
0
Orthogonal
No similarity
-1
Opposite
Opposite direction

Why Cosine for Embeddings?

Cosine similarity is preferred for embeddings because:

  • Magnitude invariant: Focus on direction, not length
  • Bounded output: Always between -1 and 1
  • Works in high dimensions: Scales well to 1000+ dimensions
  • Standard: What embedding models are optimized for

Cosine vs Euclidean Distance

Aspect Cosine Similarity Euclidean Distance
Measures Angle Straight-line distance
Range -1 to 1 0 to infinity
Magnitude matters? No Yes
Best for Text, embeddings Spatial data
Note: For normalized vectors (length = 1), cosine similarity and Euclidean distance give equivalent rankings.

Code Examples

Python

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

a = np.array([[1, 2, 3]])
b = np.array([[2, 3, 4]])

similarity = cosine_similarity(a, b)[0][0]
# 0.9926

JavaScript

function cosineSimilarity(a, b) {
    const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
    const magA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
    const magB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
    return dot / (magA * magB);
}

cosineSimilarity([1, 2, 3], [2, 3, 4]); // 0.9926

Applications in AI

  • Semantic search: Find documents similar to query
  • Recommendations: Find items similar to user preferences
  • Deduplication: Find near-duplicate content
  • Clustering: Group similar items together
  • RAG: Retrieve relevant context for LLMs

Typical Thresholds

What similarity scores mean in practice (for text embeddings):

  • > 0.9: Very similar / near duplicate
  • 0.7 - 0.9: Related content
  • 0.5 - 0.7: Somewhat related
  • < 0.5: Likely unrelated