Build a FastAPI semantic search service with Pinecone in 30 minutes

Semantic search transforms how users find information by moving beyond simple keyword matching to understanding intent through vector embeddings. To make this practical, you need a fast, scalable vector store—Pinecone fits the bill. The following walkthrough turns a FastAPI service into a full semantic search engine, from embedding generation to live similarity queries.

Why vector search beats keyword matching

Traditional search relies on exact matches, synonyms, or fuzzy text rules, which often miss the user’s true intent. Semantic search encodes documents and queries into dense vectors so that “how does machine learning work?” and “explain neural networks” land closer together than unrelated phrases. Pinecone stores these embeddings and supports approximate nearest-neighbor lookups in milliseconds, making real-time semantic search viable at scale.

Modern embedding models like all-MiniLM-L6-v2 compress sentences into 384-dimensional vectors while keeping quality high. On a standard CPU, encoding a 1 KB passage takes roughly 5 ms—fast enough for most web services and well within Pinecone’s sub-100 ms query latency when properly indexed.

Setting up a reproducible Python environment

A clean workspace ensures the tutorial runs the same on every machine. Start by creating an isolated Python 3.11 virtual environment:

python3 -m venv venv
source venv/bin/activate
python -V

With the environment active, install the four core packages:

pip install fastapi[all] uvicorn pinecone-client sentence-transformers

After installation, verify the versions:

pip list | grep -E 'fastapi|uvicorn|pinecone|sentence-transformers'

Each package comes straight from PyPI, so you’re pulling the official releases Pinecone and FastAPI maintain—no surprises during deployment.

Designing the FastAPI service with Pydantic models

The service exposes three endpoints: a health probe, a document ingestion route, and a search route. Pydantic models handle JSON validation and OpenAPI schema generation automatically.

Define two models in a new file, say models.py:

from pydantic import BaseModel

class Document(BaseModel):
    id: str
    text: str

class Query(BaseModel):
    query: str
    top_k: int = 5

These models ensure incoming payloads validate before any processing begins, reducing boilerplate and future bugs.

Building the core API and wiring Pinecone

Create main.py and import the dependencies:

from fastapi import FastAPI, HTTPException
from sentence_transformers import SentenceTransformer
import pinecone

app = FastAPI(title="Semantic Search Service")
model = SentenceTransformer('all-MiniLM-L6-v2')
pinecone.init(api_key="YOUR_PINECONE_API_KEY", environment="us-west1-gcp")
index = pinecone.Index("semantic-demo")

Add the endpoints:

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/ingest")
def ingest(doc: Document):
    vector = model.encode(doc.text).tolist()
    upsert_response = index.upsert(vectors=[(doc.id, vector, {"text": doc.text})])
    if upsert_response['upserted_count'] != 1:
        raise HTTPException(status_code=500, detail="Failed to upsert")
    return {"result": "ingested"}

@app.post("/search")
def search(q: Query):
    query_vec = model.encode(q.query).tolist()
    result = index.query(vector=query_vec, top_k=q.top_k, include_metadata=True)
    return {"matches": result["matches"]}

Spin up the server with Uvicorn:

uvicorn main:app --reload

The /health endpoint confirms the service is alive, /ingest stores embeddings with metadata, and /search returns the top k matches along with their similarity scores.

Managing your Pinecone index for performance

Pinecone indexes are partitioned collections optimized for fast vector lookups. Create one named semantic-demo with 384 dimensions and the cosine metric:

pinecone index list

The cosine metric normalizes vectors before inner-product calculation, aligning perfectly with semantic similarity tasks. Each partition handles a subset of vectors, guaranteeing O(1) write latency when upserting by document ID.

To test the pipeline, send a sample document:

curl -X POST  \
  -H "Content-Type: application/json" \
  -d '{"id":"doc1","text":"Machine learning enables computers to learn from data"}'

Then run a semantic query:

curl -X POST  \
  -H "Content-Type: application/json" \
  -d '{"query":"What is deep learning?","top_k":3}'

The response includes the most relevant document IDs, their cosine similarity scores, and the original text stored as metadata.

Scaling and securing the service

For production workloads:

Use a managed Pinecone index with auto-scaling to handle traffic spikes.
Inject the Pinecone API key via environment variables instead of hard-coding them.
Place FastAPI behind a reverse proxy like Nginx to enable HTTPS and rate limiting.
Cache frequent queries with Redis if your use case supports it.

The combination of FastAPI’s async features, Pinecone’s managed vector store, and lightweight embedding models delivers a semantic search service that remains fast, reliable, and easy to extend.

The next step is to refine the ranking logic, experiment with larger embedding models, or integrate real-time data pipelines for continuous ingestion.

AI summary

Step-by-step guide to building a FastAPI semantic search engine with Pinecone vector store, 384-dim embeddings, and production-ready endpoints.

Build a FastAPI semantic search service with Pinecone in 30 minutes

Why vector search beats keyword matching

Setting up a reproducible Python environment

Designing the FastAPI service with Pydantic models

Building the core API and wiring Pinecone

Managing your Pinecone index for performance

Scaling and securing the service

Comments

How a Final-Year BCA Student Built a Full-Stack Online Quiz Platform

Build REST APIs in Pascal with Horse and CrabPascal in minutes

Beyond Coding Challenges: Why Tech Hiring Must Evolve Past LeetCode