search
algorithms
vector-search
technical

Hybrid Search: Why Combining Vector and Keyword Search Beats Either Alone

Learn how Praevia's hybrid search approach combines semantic understanding with precise keyword matching for superior context retrieval.

PEPraevia Engineering Team
15 novembre 2024

Hybrid Search: Why Combining Vector and Keyword Search Beats Either Alone

When building Praevia's context selection engine, we faced a critical question: Should we use vector search for semantic understanding, or keyword search for precision? The answer: both.

The Search Problem

Effective context retrieval requires answering two questions simultaneously:

  1. What does this query mean? (Semantic understanding)
  2. What exact terms matter? (Lexical precision)

Neither pure vector search nor pure keyword search excels at both.

Vector embeddings capture semantic meaning beautifully, but have blind spots.

Example: Technical Queries

Query: "How do I configure the API timeout parameter?"

A pure vector search might return:

  • "Setting up API timeouts" (Good)
  • "Configuring network settings" (Too broad)
  • "API setup guide" (Missing specific parameter)

The model understands "configuration" and "API" semantically, but may miss the exact term "timeout parameter."

The Cold Start Problem

# Vector search example
query = "What is the default timeout?"
query_embedding = model.encode(query)

# Cosine similarity search using PgVector
results = db.query("""
    SELECT content, 
           1 - (embedding distance operator $1::vector) as similarity
    FROM chunks
    ORDER BY similarity DESC
    LIMIT 10
""", query_embedding)

# Results might include semantically similar but irrelevant docs
# about "default settings" in general

Keyword search (BM25, TF-IDF) excels at exact matches but struggles with synonyms and paraphrasing.

Example: Natural Language Queries

Query: "How can I reduce my LLM expenses?"

Keyword search might miss documents containing:

  • "Lower your AI costs" (synonym: reduce → lower, expenses → costs)
  • "Optimize token usage" (related concept, different words)
  • "Save money on language models" (paraphrase)

Scoring Limitations

# Simple keyword scoring
def keyword_score(text: str, query_terms: List[str]) -> float:
    score = 0.0
    text_lower = text.lower()
    
    for term in query_terms:
        # Count occurrences
        count = text_lower.count(term.lower())
        score += count
    
    return score

# Problem: "reduce" won't match "reduction" or "reducing"
# Problem: Doesn't understand "LLM" == "language model"

The Hybrid Solution

Praevia combines both approaches with weighted scoring.

Architecture

async def hybrid_search(
    query: str,
    vector_weight: float = 0.6,
    keyword_weight: float = 0.4,
    top_k: int = 50
) -> List[SearchResult]:
    """
    Hybrid search combining vector and keyword approaches.
    
    Args:
        query: User query string
        vector_weight: Weight for vector similarity (0-1)
        keyword_weight: Weight for keyword matching (0-1)
        top_k: Number of results to return
    
    Returns:
        List of ranked search results
    """
    # Step 1: Extract keywords and generate embedding
    keywords = extract_keywords(query)
    query_embedding = await generate_embedding(query)
    
    # Step 2: Parallel search
    vector_results = await vector_search(query_embedding, limit=100)
    all_chunks = await get_candidate_chunks()
    
    # Step 3: Score all chunks with both methods
    final_scores = {}
    
    for chunk in all_chunks:
        # Vector score (0-1)
        vector_score = get_vector_similarity(chunk, vector_results)
        
        # Keyword score (normalized 0-1)
        keyword_score = calculate_keyword_score(
            chunk.content, 
            keywords
        )
        
        # Weighted combination
        combined_score = (
            vector_weight * vector_score +
            keyword_weight * keyword_score
        )
        
        final_scores[chunk.id] = combined_score
    
    # Step 4: Rank and return
    return rank_by_score(final_scores, top_k)

Smart Keyword Extraction

Not all query terms are equally important:

from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_keywords(query: str, max_keywords: int = 10) -> List[str]:
    """
    Extract important keywords from query.
    Prioritizes: named entities, technical terms, rare words.
    """
    doc = nlp(query)
    keywords = []
    
    # Named entities (highest priority)
    for ent in doc.ents:
        keywords.append(ent.text.lower())
    
    # Technical terms (capitalized, acronyms)
    for token in doc:
        if token.text.isupper() and len(token.text) > 1:
            keywords.append(token.text.lower())
    
    # Nouns and verbs (exclude stop words)
    for token in doc:
        if token.pos_ in ['NOUN', 'VERB'] and not token.is_stop:
            keywords.append(token.lemma_.lower())
    
    # Remove duplicates, return top N
    return list(dict.fromkeys(keywords))[:max_keywords]

BM25 Scoring Implementation

We use a simplified BM25 variant for keyword scoring:

import math
from collections import Counter

class BM25Scorer:
    """
    BM25 scoring for keyword relevance.
    """
    
    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1  # Term frequency saturation
        self.b = b    # Length normalization
        self.avg_doc_length = 500  # Calibrated for our corpus
    
    def score(
        self, 
        document: str, 
        query_terms: List[str],
        idf_scores: Dict[str, float]
    ) -> float:
        """
        Calculate BM25 score for document given query terms.
        """
        doc_length = len(document.split())
        term_freqs = Counter(document.lower().split())
        
        score = 0.0
        for term in query_terms:
            if term not in term_freqs:
                continue
            
            # Term frequency component
            tf = term_freqs[term]
            
            # Length normalization
            norm = 1 - self.b + self.b * (doc_length / self.avg_doc_length)
            
            # BM25 formula
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * norm
            
            # Include IDF weight
            idf = idf_scores.get(term, 0)
            score += idf * (numerator / denominator)
        
        return score

Tuning the Weights

The ratio between vector and keyword weights depends on your use case:

| Use Case | Vector Weight | Keyword Weight | Reasoning | |----------|---------------|----------------|-----------| | General Q&A | 0.7 | 0.3 | Semantic understanding key | | Technical docs | 0.5 | 0.5 | Exact terms matter | | Code search | 0.4 | 0.6 | Function/variable names | | Legal docs | 0.3 | 0.7 | Precise language critical |

Dynamic Weight Adjustment

def determine_weights(query: str) -> Tuple[float, float]:
    """
    Automatically adjust weights based on query characteristics.
    """
    # Detect query type
    has_code = bool(re.search(r'[(){}\[\];]', query))
    has_quotes = '"' in query or "'" in query
    has_technical_terms = any(term.isupper() for term in query.split())
    
    # Start with default
    vector_weight = 0.6
    keyword_weight = 0.4
    
    # Adjust for code queries
    if has_code:
        vector_weight -= 0.2
        keyword_weight += 0.2
    
    # Adjust for exact phrase queries
    if has_quotes:
        vector_weight -= 0.15
        keyword_weight += 0.15
    
    # Adjust for technical queries
    if has_technical_terms:
        vector_weight -= 0.1
        keyword_weight += 0.1
    
    # Ensure weights sum to 1
    total = vector_weight + keyword_weight
    return vector_weight / total, keyword_weight / total

Real-World Performance

We benchmarked hybrid search against pure approaches on 10,000 queries:

| Metric | Vector Only | Keyword Only | Hybrid | |--------|-------------|--------------|---------| | Precision@10 | 0.72 | 0.68 | 0.84 | | Recall@50 | 0.65 | 0.61 | 0.78 | | NDCG@20 | 0.69 | 0.64 | 0.82 | | Avg Latency | 25ms | 8ms | 18ms |

Hybrid search outperforms both approaches while maintaining acceptable latency.

Implementation Tips

1. Cache Embeddings

from functools import lru_cache
import hashlib

@lru_cache(maxsize=10000)
def get_embedding(text: str) -> List[float]:
    """
    Cache embeddings to avoid redundant API calls.
    Use hash of text as cache key.
    """
    return embedding_model.encode(text)

2. Precompute IDF Scores

def precompute_idf(all_documents: List[str]) -> Dict[str, float]:
    """
    Compute IDF scores once at startup.
    Store in memory for fast keyword scoring.
    """
    n_docs = len(all_documents)
    term_doc_freq = defaultdict(int)
    
    for doc in all_documents:
        unique_terms = set(doc.lower().split())
        for term in unique_terms:
            term_doc_freq[term] += 1
    
    idf_scores = {}
    for term, doc_freq in term_doc_freq.items():
        idf_scores[term] = math.log((n_docs + 1) / (doc_freq + 1))
    
    return idf_scores

For large datasets, exact cosine similarity is slow. Use approximate methods:

-- Create IVFFlat index for faster approximate search
CREATE INDEX ON chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

-- Query with approximate search
SET ivfflat.probes = 10;
SELECT * FROM chunks
ORDER BY embedding distance query_embedding
LIMIT 50;

Conclusion

Hybrid search isn't just a nice-to-have—it's essential for production-grade retrieval systems. By combining the semantic understanding of vector search with the precision of keyword matching, Praevia delivers superior context selection that translates directly to better LLM responses and lower token costs.

The key is finding the right balance for your use case and implementing smart optimizations to keep latency low.


Interested in implementing hybrid search in your application? Get started with Praevia or read our API documentation.

COMPATIBLE WITH ALL MAJOR LLMS

Compatible With All Major LLMs

Praevia works seamlessly with OpenAI, Anthropic, Google, Meta, and more. Optimize your costs regardless of your provider.

Get in Touch With Our Team

We respond to every message within 24 hours.

Email

Reach out via email for any assistance you need.

Office

Toronto, Canada

Visit us at our headquarters.

Phone

Available Monday–Friday, 9 AM – 6 PM EST.