RAG (Retrieval-Augmented Generation) — embed documents with models like OpenAI text-embedding-3-large or Cohere, upsert into Pinecone, and retrieve top-K chunks at query time to inject into LLM context windows.
Multi-tenant agent memory — use one index with up to 1.7M namespaces to provide isolated, per-agent vector stores without provisioning separate infrastructure.
Billion-scale ANN search — run approximate nearest-neighbor queries across 1B+ dense vectors at 31ms p50 with automatic index rebalancing and no manual HNSW/IVF tuning.
Metadata-filtered vector search — apply structured filters (e.g., category == "legal" AND date > 2024-01-01) evaluated inside the query engine to avoid post-filtering latency overhead.
Semantic cache layer — store past LLM prompt-response pairs as vectors; incoming queries that exceed a similarity threshold are served from cache, cutting token consumption by 70-95%.
Hybrid search — combine dense and sparse vector indexes (BM25 + embeddings) for lexical + semantic retrieval in a single query.

Pinecone