Production vector database setup using Pinecone (managed, serverless, scales to billions of vectors), Weaviate (open-source, supports hybrid search natively), Qdrant (Rust-based, excellent p99 latency for real-time applications), or pgvector (PostgreSQL extension, ideal when your data already lives in Postgres and you want to avoid an additional infrastructure component) depending on your scale, latency requirements, and infrastructure preferences.
Embedding model selection: OpenAI text-embedding-3-large (3072 dimensions, best accuracy, $0.13/million tokens), text-embedding-3-small (1536 dimensions, 95% of the accuracy at 5x lower cost), Cohere embed-multilingual-v3.0 for multilingual corpora, or open-source alternatives (BGE-M3, E5-large) for environments with data residency requirements where documents cannot leave your infrastructure. Chunking strategy is content-type specific: documentation benefits from semantic chunking at section boundaries; contracts benefit from clause-level chunking with parent document metadata; conversational transcripts benefit from overlapping sliding windows (512 tokens with 50-token overlap). Hybrid search combines dense vector retrieval with BM25 keyword search, the dense retrieval captures semantic similarity while BM25 captures exact-match terms, and a Reciprocal Rank Fusion (RRF) algorithm merges the two rankings. Reranking pipeline using Cohere Rerank or a cross-encoder model (ms-marco-MiniLM-L-6-v2) scores the top-50 retrieved candidates against the query for actual relevance before the top 5 pass to the LLM context. Retrieval quality evaluation using a labelled question-document pair test set measures Mean Reciprocal Rank (MRR) and Hit@k before deployment.