Semantic search transforms how users find information by moving beyond simple keyword matching to understanding intent through vector embeddings. To make this practical, you need a fast, scalable vector store—Pinecone fits the bill. The following walkthrough turns a FastAPI service into a full semantic search engine, from embedding generation to live similarity queries.
Why vector search beats keyword matching
Traditional search relies on exact matches, synonyms, or fuzzy text rules, which often miss the user’s true intent. Semantic search encodes documents and queries into dense vectors so that “how does machine learning work?” and “explain neural networks” land closer together than unrelated phrases. Pinecone stores these embeddings and supports approximate nearest-neighbor lookups in milliseconds, making real-time semantic search viable at scale.
Modern embedding models like all-MiniLM-L6-v2 compress sentences into 384-dimensional vectors while keeping quality high. On a standard CPU, encoding a 1 KB passage takes roughly 5 ms—fast enough for most web services and well within Pinecone’s sub-100 ms query latency when properly indexed.
Setting up a reproducible Python environment
A clean workspace ensures the tutorial runs the same on every machine. Start by creating an isolated Python 3.11 virtual environment:
python3 -m venv venv
source venv/bin/activate
python -VWith the environment active, install the four core packages:
pip install fastapi[all] uvicorn pinecone-client sentence-transformersAfter installation, verify the versions:
pip list | grep -E 'fastapi|uvicorn|pinecone|sentence-transformers'Each package comes straight from PyPI, so you’re pulling the official releases Pinecone and FastAPI maintain—no surprises during deployment.
Designing the FastAPI service with Pydantic models
The service exposes three endpoints: a health probe, a document ingestion route, and a search route. Pydantic models handle JSON validation and OpenAPI schema generation automatically.
Define two models in a new file, say models.py:
from pydantic import BaseModel
class Document(BaseModel):
id: str
text: str
class Query(BaseModel):
query: str
top_k: int = 5These models ensure incoming payloads validate before any processing begins, reducing boilerplate and future bugs.
Building the core API and wiring Pinecone
Create main.py and import the dependencies:
from fastapi import FastAPI, HTTPException
from sentence_transformers import SentenceTransformer
import pinecone
app = FastAPI(title="Semantic Search Service")
model = SentenceTransformer('all-MiniLM-L6-v2')
pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="us-west1-gcp")
index = pinecone.Index("semantic-demo")Add the endpoints:
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/ingest")
def ingest(doc: Document):
vector = model.encode(doc.text).tolist()
upsert_response = index.upsert(vectors=[(doc.id, vector, {"text": doc.text})])
if upsert_response['upserted_count'] != 1:
raise HTTPException(status_code=500, detail="Failed to upsert")
return {"result": "ingested"}
@app.post("/search")
def search(q: Query):
query_vec = model.encode(q.query).tolist()
result = index.query(vector=query_vec, top_k=q.top_k, include_metadata=True)
return {"matches": result["matches"]}Spin up the server with Uvicorn:
uvicorn main:app --reloadThe /health endpoint confirms the service is alive, /ingest stores embeddings with metadata, and /search returns the top k matches along with their similarity scores.
Managing your Pinecone index for performance
Pinecone indexes are partitioned collections optimized for fast vector lookups. Create one named semantic-demo with 384 dimensions and the cosine metric:
pinecone index listThe cosine metric normalizes vectors before inner-product calculation, aligning perfectly with semantic similarity tasks. Each partition handles a subset of vectors, guaranteeing O(1) write latency when upserting by document ID.
To test the pipeline, send a sample document:
curl -X POST \
-H "Content-Type: application/json" \
-d '{"id":"doc1","text":"Machine learning enables computers to learn from data"}'Then run a semantic query:
curl -X POST \
-H "Content-Type: application/json" \
-d '{"query":"What is deep learning?","top_k":3}'The response includes the most relevant document IDs, their cosine similarity scores, and the original text stored as metadata.
Scaling and securing the service
For production workloads:
- Use a managed Pinecone index with auto-scaling to handle traffic spikes.
- Inject the Pinecone API key via environment variables instead of hard-coding them.
- Place FastAPI behind a reverse proxy like Nginx to enable HTTPS and rate limiting.
- Cache frequent queries with Redis if your use case supports it.
The combination of FastAPI’s async features, Pinecone’s managed vector store, and lightweight embedding models delivers a semantic search service that remains fast, reliable, and easy to extend.
The next step is to refine the ranking logic, experiment with larger embedding models, or integrate real-time data pipelines for continuous ingestion.
AI summary
Step-by-step guide to building a FastAPI semantic search engine with Pinecone vector store, 384-dim embeddings, and production-ready endpoints.