Vector Databases

Purpose-built databases for storing and searching vector embeddings at scale.

10 min read Popular

What is a Vector Database?

A vector database is a specialized database designed to store, index, and query high-dimensional vectors efficiently. Unlike traditional databases that excel at exact matches, vector databases find similar items.

Why Do We Need Them?

Traditional databases can't efficiently search vectors. Finding the nearest neighbors among millions of 1536-dimensional vectors requires specialized algorithms.

Traditional DB

Exact match: WHERE id = 123

Fast for equality, useless for similarity

Vector DB

Similarity search: Find top 10 nearest

Optimized for approximate nearest neighbor

How They Work

Indexing Algorithms

Vector databases use specialized indexes:

  • HNSW (Hierarchical Navigable Small World) - Graph-based, most popular
  • IVF (Inverted File Index) - Cluster-based
  • PQ (Product Quantization) - Compression-based
  • Annoy - Tree-based, by Spotify

ANN vs Exact Search

Most vector databases use Approximate Nearest Neighbor (ANN) search. It's much faster than exact search with minimal accuracy loss.

Database Type Best For
Pinecone Managed Production, ease of use
Weaviate Open Source Hybrid search, GraphQL
Qdrant Open Source Performance, filtering
Milvus Open Source Scale, enterprise
Chroma Open Source Local dev, Python
pgvector Extension PostgreSQL users

Basic Operations

1. Insert Vectors

# Pinecone example
index.upsert([
    {"id": "vec1", "values": [0.1, 0.2, ...], "metadata": {"text": "..."}},
    {"id": "vec2", "values": [0.3, 0.4, ...], "metadata": {"text": "..."}}
])

2. Query Similar Vectors

results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=10,
    include_metadata=True
)

# Returns: [{"id": "vec1", "score": 0.95}, ...]

3. Filter Results

results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=10,
    filter={"category": "science"}
)

How to Choose

Choose Managed (Pinecone) if:

  • You want zero infrastructure management
  • You need enterprise support
  • Budget allows for managed services

Choose Open Source (Qdrant, Weaviate) if:

  • You want to self-host
  • You need full control
  • Cost is a concern at scale

Choose pgvector if:

  • You already use PostgreSQL
  • You want vectors alongside relational data
  • Scale is moderate (<1M vectors)

Next Steps