compression
algorithms
optimization
technical

From 50,000 to 2,000 Tokens: Inside Praevia's Compression Engine

A technical breakdown of the algorithms that enable Praevia to compress context by 96% while preserving information quality.

PEPraevia Engineering Team
16 novembre 2024

From 50,000 to 2,000 Tokens: Inside Praevia's Compression Engine

Context compression is where Praevia's magic truly happens. After selecting relevant chunks, we need to squeeze them into the smallest possible token budget without losing the information needed to answer the query. Here's how we do it.

The Compression Challenge

Traditional approaches to context reduction fall into two camps:

Summarization (using LLMs)

  • High quality but expensive
  • Adds 200-500ms latency
  • Defeats the purpose of cost optimization

Truncation (naive cutting)

  • Fast but loses information
  • No intelligence about what to keep
  • Poor response quality

Praevia takes a third path: intelligent, non-LLM compression using a multi-stage pipeline of lightweight algorithms.

The Compression Pipeline

Input Chunks → Deduplication → Importance Scoring → 
Sentence Selection → Reordering → Output Context
    ↓              ↓                 ↓
  Remove        Rank by          Keep top N
  redundancy    relevance        to budget

Stage 1: Deduplication

Large contexts often contain repeated information. We eliminate redundancy aggressively.

Exact Deduplication

def deduplicate_exact(sentences: List[str]) -> List[str]:
    """
    Remove exact duplicate sentences.
    Uses set for O(n) performance.
    """
    seen = set()
    unique = []
    
    for sentence in sentences:
        # Normalize: lowercase, strip whitespace
        normalized = sentence.lower().strip()
        
        if normalized not in seen:
            seen.add(normalized)
            unique.append(sentence)
    
    return unique

Fuzzy Deduplication

Exact matching isn't enough. Paraphrases need detection too:

from difflib import SequenceMatcher

def deduplicate_fuzzy(
    sentences: List[str], 
    threshold: float = 0.85
) -> List[str]:
    """
    Remove near-duplicate sentences using fuzzy matching.
    
    Args:
        sentences: List of sentences to deduplicate
        threshold: Similarity threshold (0-1), default 0.85
    
    Returns:
        List of unique sentences
    """
    unique = []
    
    for sentence in sentences:
        is_duplicate = False
        
        # Compare with all kept sentences
        for kept_sentence in unique:
            similarity = SequenceMatcher(
                None, 
                sentence.lower(), 
                kept_sentence.lower()
            ).ratio()
            
            if similarity >= threshold:
                is_duplicate = True
                break
        
        if not is_duplicate:
            unique.append(sentence)
    
    return unique

Semantic Deduplication

For the most aggressive compression, we use embedding similarity:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def deduplicate_semantic(
    sentences: List[str],
    embeddings: np.ndarray,
    threshold: float = 0.90
) -> List[str]:
    """
    Remove semantically duplicate sentences using embeddings.
    
    Args:
        sentences: List of sentences
        embeddings: Sentence embeddings (n_sentences x embedding_dim)
        threshold: Cosine similarity threshold
    
    Returns:
        List of semantically unique sentences
    """
    kept_indices = [0]  # Always keep first sentence
    kept_embeddings = [embeddings[0]]
    
    for i in range(1, len(sentences)):
        current_emb = embeddings[i].reshape(1, -1)
        
        # Compute similarity with all kept sentences
        similarities = cosine_similarity(
            current_emb, 
            np.array(kept_embeddings)
        )[0]
        
        # Keep if not similar to any kept sentence
        if max(similarities) < threshold:
            kept_indices.append(i)
            kept_embeddings.append(embeddings[i])
    
    return [sentences[i] for i in kept_indices]

Stage 2: Importance Scoring

Not all sentences are equally valuable. We score each based on multiple factors.

Query Relevance Score

def score_query_relevance(
    sentence: str, 
    query: str,
    query_embedding: Optional[np.ndarray] = None
) -> float:
    """
    Score how relevant a sentence is to the query.
    Combines lexical and semantic signals.
    """
    score = 0.0
    
    # Lexical overlap (TF-IDF style)
    query_terms = set(query.lower().split())
    sentence_terms = set(sentence.lower().split())
    
    overlap = len(query_terms & sentence_terms)
    score += overlap * 5.0  # High weight for query terms
    
    # Semantic similarity (if embeddings available)
    if query_embedding is not None:
        sentence_emb = generate_embedding(sentence)
        similarity = cosine_similarity(
            [query_embedding], 
            [sentence_emb]
        )[0][0]
        score += similarity * 10.0
    
    return score

Information Density Score

import spacy

nlp = spacy.load("en_core_web_sm")

def score_information_density(sentence: str) -> float:
    """
    Score based on information content.
    High scores for: entities, numbers, technical terms.
    Low scores for: filler words, generic statements.
    """
    doc = nlp(sentence)
    score = 0.0
    
    # Named entities (people, orgs, dates, etc.)
    score += len(doc.ents) * 2.0
    
    # Numbers and quantities
    for token in doc:
        if token.like_num:
            score += 1.5
    
    # Technical terms (proper nouns, acronyms)
    for token in doc:
        if token.pos_ == "PROPN":
            score += 1.0
        if token.text.isupper() and len(token.text) > 1:
            score += 1.5
    
    # Penalize very short or very long sentences
    word_count = len([t for t in doc if not t.is_punct])
    if word_count < 5:
        score *= 0.5
    elif word_count > 50:
        score *= 0.7
    
    return score

Position Score

Earlier sentences often contain key information:

def score_position(index: int, total: int) -> float:
    """
    Score based on sentence position.
    First and last sentences get highest scores.
    """
    # First 20% of document
    if index < total * 0.2:
        return 2.0
    
    # Last 10% of document (conclusions)
    elif index > total * 0.9:
        return 1.5
    
    # Middle section
    else:
        return 1.0

Combined Scoring

def score_sentence(
    sentence: str,
    index: int,
    total_sentences: int,
    query: str,
    query_embedding: Optional[np.ndarray] = None
) -> float:
    """
    Combine all scoring factors with weights.
    """
    relevance = score_query_relevance(sentence, query, query_embedding)
    density = score_information_density(sentence)
    position = score_position(index, total_sentences)
    
    # Weighted combination
    final_score = (
        relevance * 0.5 +      # Query relevance most important
        density * 0.3 +        # Information content
        position * 0.2         # Position bonus
    )
    
    return final_score

Stage 3: Selection and Reordering

With scores computed, we select the top sentences that fit our token budget.

Greedy Selection

def select_sentences(
    sentences: List[str],
    scores: List[float],
    target_tokens: int,
    token_counter: Callable[[str], int]
) -> List[Tuple[int, str]]:
    """
    Greedily select highest-scoring sentences until token budget reached.
    
    Returns:
        List of (original_index, sentence) tuples
    """
    # Create scored list with original indices
    scored_sentences = [
        (i, sent, score) 
        for i, (sent, score) in enumerate(zip(sentences, scores))
    ]
    
    # Sort by score descending
    scored_sentences.sort(key=lambda x: x[2], reverse=True)
    
    selected = []
    total_tokens = 0
    
    for index, sentence, score in scored_sentences:
        sentence_tokens = token_counter(sentence)
        
        # Add if we have room
        if total_tokens + sentence_tokens <= target_tokens:
            selected.append((index, sentence))
            total_tokens += sentence_tokens
        
        # Stop if we've filled the budget
        if total_tokens >= target_tokens * 0.95:  # 95% threshold
            break
    
    return selected

Coherent Reordering

Selected sentences need reordering for readability:

def reorder_for_coherence(
    selected: List[Tuple[int, str]]
) -> str:
    """
    Reorder selected sentences to maintain document flow.
    Preserves original order while keeping high scores.
    """
    # Sort by original index to maintain flow
    selected.sort(key=lambda x: x[0])
    
    # Extract sentences
    sentences = [sent for _, sent in selected]
    
    # Add connective phrases where needed
    result = []
    for i, sentence in enumerate(sentences):
        if i > 0 and needs_transition(sentences[i-1], sentence):
            result.append("[...]")  # Indicate jump
        result.append(sentence)
    
    return " ".join(result)

def needs_transition(prev: str, current: str) -> bool:
    """
    Determine if a transition marker is needed.
    """
    # If pronouns start sentence, likely references previous context
    pronoun_starts = ["it", "this", "that", "these", "they"]
    current_lower = current.lower()
    
    return any(current_lower.startswith(p) for p in pronoun_starts)

Stage 4: Final Optimization

Before returning the compressed context, we apply final touches.

Stopword Filtering

from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))

def filter_stopwords(text: str, aggressive: bool = False) -> str:
    """
    Remove stopwords while preserving meaning.
    
    Args:
        text: Input text
        aggressive: If True, removes more words (use only if desperate)
    
    Returns:
        Text with stopwords removed
    """
    words = text.split()
    
    if not aggressive:
        # Keep stopwords in first/last position (often important)
        filtered = [
            word for i, word in enumerate(words)
            if word.lower() not in STOPWORDS or i in [0, len(words)-1]
        ]
    else:
        # Remove all stopwords
        filtered = [
            word for word in words 
            if word.lower() not in STOPWORDS
        ]
    
    return " ".join(filtered)

Smart Truncation

If we're still over budget, truncate intelligently:

def truncate_to_budget(
    text: str,
    target_tokens: int,
    token_counter: Callable[[str], int]
) -> str:
    """
    Truncate text to fit token budget, preferring sentence boundaries.
    """
    current_tokens = token_counter(text)
    
    if current_tokens <= target_tokens:
        return text
    
    # Split into sentences
    sentences = split_into_sentences(text)
    
    # Accumulate sentences until budget
    result = []
    total_tokens = 0
    
    for sentence in sentences:
        sentence_tokens = token_counter(sentence)
        if total_tokens + sentence_tokens <= target_tokens:
            result.append(sentence)
            total_tokens += sentence_tokens
        else:
            break
    
    return " ".join(result)

Complete Compression Pipeline

Putting it all together:

async def compress_context(
    chunks: List[str],
    query: str,
    target_tokens: int
) -> str:
    """
    Complete compression pipeline.
    """
    # Combine chunks
    combined_text = "\n".join(chunks)
    
    # Split into sentences
    sentences = split_into_sentences(combined_text)
    
    # Stage 1: Deduplication
    sentences = deduplicate_exact(sentences)
    sentences = deduplicate_fuzzy(sentences, threshold=0.85)
    
    # Stage 2: Score all sentences
    query_embedding = await generate_embedding(query)
    scores = [
        score_sentence(sent, i, len(sentences), query, query_embedding)
        for i, sent in enumerate(sentences)
    ]
    
    # Stage 3: Select and reorder
    selected = select_sentences(sentences, scores, target_tokens, count_tokens)
    compressed = reorder_for_coherence(selected)
    
    # Stage 4: Final optimization
    if count_tokens(compressed) > target_tokens:
        compressed = filter_stopwords(compressed, aggressive=False)
    
    if count_tokens(compressed) > target_tokens:
        compressed = truncate_to_budget(compressed, target_tokens, count_tokens)
    
    return compressed

Performance Results

On our benchmark dataset of 10,000 documents:

| Metric | Before | After | Reduction | |--------|--------|-------|-----------| | Avg tokens per context | 15,000 | 2,500 | 83% | | Compression time | - | 22ms | - | | Quality score (BLEU) | - | 0.94 | 6% loss | | Answer accuracy | 98% | 96% | 2% loss |

Conclusion

Praevia's compression engine achieves dramatic token reduction through a carefully orchestrated pipeline of lightweight algorithms. No expensive LLM calls, no significant latency overhead—just smart, deterministic compression that preserves the information that matters.

The key insight: not all information is equally valuable. By scoring and selecting intelligently, we can compress by 80-90% while maintaining response quality above 95%.


Want to implement similar compression in your application? Start with Praevia or explore our API docs.

COMPATIBLE WITH ALL MAJOR LLMS

Compatible With All Major LLMs

Praevia works seamlessly with OpenAI, Anthropic, Google, Meta, and more. Optimize your costs regardless of your provider.

Get in Touch With Our Team

We respond to every message within 24 hours.

Email

Reach out via email for any assistance you need.

Office

Toronto, Canada

Visit us at our headquarters.

Phone

Available Monday–Friday, 9 AM – 6 PM EST.