What We'll Build
A simple semantic search over a collection of documents using Python, sentence-transformers, and Chroma.
Step 1: Install Dependencies
pip install chromadb sentence-transformers
Step 2: Create the Search Engine
import chromadb
from sentence_transformers import SentenceTransformer
# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Initialize Chroma (in-memory for demo)
client = chromadb.Client()
collection = client.create_collection("docs")
# Sample documents
documents = [
"Python is a programming language",
"JavaScript runs in the browser",
"Machine learning uses neural networks",
"CSS styles web pages",
"Vectors represent data as numbers"
]
# Add documents to collection
collection.add(
documents=documents,
ids=[f"doc{i}" for i in range(len(documents))],
embeddings=model.encode(documents).tolist()
)
Step 3: Search
# Search function
def search(query, k=3):
query_embedding = model.encode(query).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=k
)
return results['documents'][0]
# Try it!
print(search("web development"))
# ['CSS styles web pages', 'JavaScript runs in the browser', ...]
print(search("AI and data"))
# ['Machine learning uses neural networks', 'Vectors represent...', ...]
Step 4: Add Metadata
# Add with metadata
collection.add(
documents=["React is a UI library"],
ids=["doc5"],
embeddings=model.encode(["React is a UI library"]).tolist(),
metadatas=[{"category": "frontend", "year": 2024}]
)
# Filter by metadata
results = collection.query(
query_embeddings=[model.encode("UI frameworks").tolist()],
n_results=5,
where={"category": "frontend"}
)
Full Working Example
import chromadb
from sentence_transformers import SentenceTransformer
# Setup
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.Client()
collection = client.create_collection("my_docs")
# Your documents
docs = [
"How to make pasta carbonara",
"Best practices for React hooks",
"Introduction to vector databases",
"Guide to machine learning basics"
]
# Index
collection.add(
documents=docs,
ids=[f"d{i}" for i in range(len(docs))],
embeddings=model.encode(docs).tolist()
)
# Search
query = "cooking Italian food"
results = collection.query(
query_embeddings=[model.encode(query).tolist()],
n_results=2
)
print(f"Query: {query}")
for doc in results['documents'][0]:
print(f" - {doc}")
Next Steps
- Try different embedding models
- Add more documents
- Implement filtering
- Connect to a persistent database
- Build a web API around it