Question 1

What is Euclidean distance?

Accepted Answer

Euclidean distance is the straight-line distance between two points in space - the length of a line segment connecting them. It's calculated using the Pythagorean theorem extended to any number of dimensions. For two vectors, you find the difference in each dimension, square them, sum them up, and take the square root. It's the most intuitive distance metric and is also known as L2 distance or L2 norm.

Question 2

What is the difference between Euclidean and Manhattan distance?

Accepted Answer

Euclidean distance measures the shortest straight-line path between two points (like flying), while Manhattan distance measures the path along grid lines (like walking city blocks). Euclidean uses squared differences and a square root; Manhattan uses absolute differences summed directly. Manhattan is often better for sparse or high-dimensional data because it's less affected by the 'curse of dimensionality' and individual outlier dimensions have less impact.

Question 3

When should I use cosine similarity?

Accepted Answer

Use cosine similarity when the direction of vectors matters more than their magnitude. This is ideal for text similarity (comparing documents of different lengths), recommendation systems, semantic search with embeddings, and any scenario where you care about the relationship or pattern in the data rather than absolute values. It's the go-to metric for NLP applications and vector embeddings from language models.

Question 4

What is L1 vs L2 norm?

Accepted Answer

L1 norm (also called Manhattan norm or taxicab norm) is the sum of absolute values of vector components. L2 norm (also called Euclidean norm) is the square root of the sum of squared components. When used to measure distance between vectors, L1 gives Manhattan distance and L2 gives Euclidean distance. L1 is more robust to outliers and promotes sparsity in machine learning, while L2 gives smoother solutions and penalizes large values more heavily.

Question 5

What is the difference between cosine similarity and cosine distance?

Accepted Answer

Cosine similarity measures how similar two vectors are (ranging from -1 to 1, where 1 means identical direction). Cosine distance is simply 1 minus the cosine similarity, converting it to a distance metric (ranging from 0 to 2, where 0 means identical). They contain the same information but are inverse of each other - use similarity when you want to find the most similar items, and distance when you need a proper distance metric for algorithms like k-NN.

Question 6

Which distance metric is best for high-dimensional data?

Accepted Answer

For high-dimensional data, cosine similarity or Manhattan distance often work better than Euclidean distance. This is due to the 'curse of dimensionality' - in high dimensions, Euclidean distances tend to become similar for all pairs of points, losing discriminative power. Cosine similarity avoids this by focusing on direction. Manhattan distance is also more stable because it doesn't square the differences, making it less sensitive to noise in individual dimensions.

Question 7

How do vector databases use distance metrics?

Accepted Answer

Vector databases use distance metrics to find the most similar vectors to a query vector. When you perform a similarity search, the database calculates distances between your query and stored vectors using the configured metric. Most vector databases (Pinecone, Weaviate, Qdrant, Milvus) support multiple metrics. Cosine similarity is most common for text embeddings, while Euclidean is often used for image features. The choice of metric significantly impacts search results and should match your embedding model's training.

Question 8

What is dot product distance and when is it equivalent to cosine?

Accepted Answer

Dot product distance is the negative of the dot product between two vectors. When vectors are normalized (have unit length), dot product becomes equivalent to cosine similarity because the denominator in the cosine formula equals 1. Many embedding models output normalized vectors, making dot product a faster alternative to cosine (fewer computations). Always check if your embeddings are normalized - if so, use dot product for speed; if not, use cosine for correct results.

Question 9

Can I use different distance metrics for the same data?

Accepted Answer

Yes, you can use different metrics on the same data, and it often makes sense to experiment. Different metrics will rank similarities differently and may surface different nearest neighbors. However, the metric should match your use case and ideally match what the embedding model was trained with. For example, OpenAI embeddings are optimized for cosine similarity. Using Euclidean on them would work but might give suboptimal results.

Question 10

What is Hamming distance and when is it used?

Accepted Answer

Hamming distance counts the number of positions where corresponding elements differ between two vectors. It's primarily used for binary vectors or strings. In machine learning, it's common for binary hash codes, error detection/correction in communications, and comparing binary feature vectors. For example, comparing two binary strings '1010' and '1001' gives a Hamming distance of 2 (two positions differ). It's extremely fast to compute using XOR operations.

Metric	Magnitude Sensitive	Best For
Euclidean	Yes	Physical measurements
Manhattan	Yes	Grid data, sparse vectors
Cosine	No	Text, embeddings
Dot Product	Yes*	Normalized vectors

Distance Metrics

Why Different Metrics?

Euclidean Distance (L2)

Manhattan Distance (L1)

Cosine Distance

Dot Product Distance

Comparison Table

In AI/ML