Sentence Embeddings: Beyond Words

From Words to Sentences

While Word2Vec embeddings represent individual words, sentence embeddings capture the meaning of entire sentences, paragraphs, or documents.

Popular Models

Model	Dimensions	Notes
all-MiniLM-L6-v2	384	Fast, good quality
all-mpnet-base-v2	768	Best open-source
text-embedding-3-small	1536	OpenAI, excellent
text-embedding-3-large	3072	OpenAI, highest quality
Cohere embed-v3	1024	Multilingual

How They Work

Built on transformer architecture (BERT, etc.)
Trained on sentence pairs (similar/dissimilar)
Contrastive learning pulls similar sentences together
Pool token embeddings into single vector

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "I love programming",
    "Coding is my passion",
    "The weather is nice"
]

embeddings = model.encode(sentences)
# embeddings[0] and embeddings[1] will be similar

Key Advantages

Captures full sentence meaning, not just words
Handles synonyms and paraphrases
Works across languages (multilingual models)
Fixed-size output regardless of input length

Choosing a Model

Speed critical: all-MiniLM-L6-v2
Best quality (open): all-mpnet-base-v2
Production (API): OpenAI or Cohere
Multilingual: multilingual-e5-large