|
Navigation
Search
|
Four important lessons about context engineering
Thursday November 27, 2025. 10:00 AM , from InfoWorld
Context engineering has emerged as one of the most critical skills in working with large language models (LLMs). While much attention has been paid to prompt engineering, the art and science of managing context—i.e., the information the model has access to when generating responses—often determines the difference between mediocre and exceptional AI applications.
After years of building with LLMs, we’ve learned that context isn’t just about stuffing as much information as possible into a prompt. It’s about strategic information architecture that maximizes model performance within technical constraints. The technical reality of context windows Modern LLMs operate with context windows ranging from 8K to 200K+ tokens, with some models claiming even larger windows. However, several technical realities shape how we should think about context: Lost in the middle effect: Research has consistently shown that LLMs experience attention degradation in the middle portions of long contexts. Models perform best with information placed at the beginning or end of the context window. This isn’t a bug. It’s an artifact of how transformer architectures process sequences. Effective vs. theoretical capacity: A model with a 128K context window doesn’t process all 128K tokens with equal fidelity. Beyond certain thresholds (often around 32K to 64K tokens), accuracy degrades measurably. Think of it like human working memory. We can technically hold many things in mind, but we work best with a focused subset. Computational costs: Context length impacts latency and cost quadratically in many architectures. A 100K token context doesn’t cost 10x a 10K context, it can cost 100x in compute terms, even if providers don’t pass all costs to users. What we learned about context engineering Our experience building an AI CRM taught us four important lessons about context engineering: Recency and relevance trump volume Structure matters as much as content Context hierarchy creates better retrieval Stateless is a feature, not a bug I will unpack each of those below, then I’ll share some practical tips, useful patterns, and common antipatterns to avoid. Lesson 1: Recency and relevance trump volume The most important insight: more context isn’t better context. In production systems, we’ve seen dramatic improvements by reducing context size and increasing relevance. Example: When extracting deal details from Gmail, sending every email with a contact performs worse than sending only emails semantically related to the active opportunity. We’ve seen models hallucinate close dates by pulling information from a different, unrelated deal mentioned six months ago because they couldn’t distinguish signal from noise. Lesson 2: Structure matters as much as content LLMs respond better to structured context than unstructured dumps. XML tags, markdown headers, and clear delimiters help models parse and attend to the right information. Poor context structure: Here’s some info about the user: John Smith, age 35, from New York, likes pizza, works at Acme Corp, signed up in 2020, last login yesterday… Better context structure: xml John Smith 35 New York 2020-03-15 2024-10-16 Pizza The structured version helps the model quickly locate relevant information without parsing natural language descriptions. Lesson 3: Context hierarchy creates better retrieval Organize context by importance and relevance, not chronologically or alphabetically. Place critical information early and late in the context window. Optimal ordering: System instructions (beginning) Current user query (beginning) Most relevant retrieved information (beginning) Supporting context (middle) Examples and edge cases (middle-end) Final instructions or constraints (end) Lesson 4: Stateless is a feature, not a bug Each LLM call is stateless. This isn’t a limitation to overcome, but an architectural choice to embrace. Rather than trying to maintain massive conversation histories, implement smart context management: Store full conversation state in your application Send only relevant history per request Use semantic chunking to identify what matters Implement conversation summarization for long interactions Practical tips for production systems Tip 1: Implement semantic chunking Don’t send entire documents. Chunk content semantically (by topic, section, or concept) and use embeddings to retrieve only relevant chunks. Implementation pattern: Query → Generate embedding → Similarity search → Retrieve top-k chunks → Rerank if needed → Construct context → LLM call Typical improvement: 60% to 80% reduction in context size with 20% to 30% improvement in response quality. Tip 2: Use progressive context loading For complex queries, start with minimal context and progressively add more if needed: First attempt: core instructions plus query If uncertain: add relevant documentation If still uncertain: add examples and edge cases This reduces average latency and cost while maintaining quality for complex queries. Tip 3: Context compression techniques Three techniques can compress context without losing information: Entity extraction: Instead of full documents, extract and send key entities, relationships, and facts. Summarization: For historical conversations, summarize older messages rather than sending verbatim text. Use LLMs themselves to create these summaries. Schema enforcement: Use structured formats (JSON, XML) to minimize token usage compared to natural language. Tip 4: Implement context windows For conversation systems, maintain sliding windows of different sizes: Immediate window (last three to five turns): Full verbatim messages Recent window (last 10 to 20 turns): Summarized key points Historical window (older): High-level summary of topics discussed Tip 5: Cache smartly Many LLM providers now offer prompt caching. Structure your context so stable portions (system instructions, reference documents) appear first and can be cached, while dynamic portions (user queries, retrieved context) come after the cache boundary. Typical savings: 50% to 90% reduction in input token costs for repeated contexts. Tip 6: Measure context utilization Instrument your system to track: Average context size per request Cache hit rates Retrieval relevance scores Response quality vs. context size This data reveals optimization opportunities. We’ve found that many production systems use 2x to 3x more context than optimal. Tip 7: Handle context overflow gracefully When context exceeds limits: Prioritize user query and critical instructions Truncate middle sections first Implement automatic summarization Return clear errors rather than silently truncating Advanced patterns Multi-turn context management For agentic systems that make multiple LLM calls: Pattern: Maintain a context accumulator that grows with each turn, but implement smart summarization after N turns to prevent unbounded growth. Turn 1: Full context Turn 2: Full context + Turn 1 result Turn 3: Full context + Summarized(Turns 1-2) + Turn 3 Hierarchical context retrieval For retrieval-augmented generation (RAG) systems, implement multi-level retrieval: Retrieve relevant documents Within documents, retrieve relevant sections Within sections, retrieve relevant paragraphs Each level narrows focus and improves relevance. Context-aware prompt templates Create templates that adapt based on available context: if context_size < 4000: template = detailed_template # Room for examples elif context_size < 8000: template = standard_template # Concise instructions else: template = minimal_template # Just essentials Common antipatterns to avoid Antipattern 1: Sending entire conversation histories verbatim. This wastes tokens on greetings, acknowledgments, and off-topic banter. Antipattern 2: Dumping database records without filtering. Send only fields relevant to the query. Antipattern 3: Repeating instructions in every message. Use system prompts or cached prefixes instead. Antipattern 4: Ignoring the lost-in-the-middle effect. Don’t bury critical information in long contexts. Antipattern 5: Over-relying on maximum context windows. Just because you can use 128K tokens doesn’t mean you should. Looking forward Context engineering will remain critical as models evolve. Emerging patterns include: Infinite context models: Techniques for handling arbitrarily long contexts through retrieval augmentation Context compression models: Specialized models that compress context for primary LLMs Learned context selection: Machine learning models that predict optimal context for queries Multi-modal context: Integrating images, audio, and structured data seamlessly Effective context engineering requires understanding both the technical constraints of LLMs and the information architecture of your application. The goal isn’t to maximize context. It’s to provide the right information, in the right format, at the right position. Start by measuring your current context utilization, implement semantic retrieval, structure your context clearly, and iterate based on quality metrics. The systems that win aren’t those that send the most context. They’re those that send the most relevant context. The future of LLM applications is less about bigger context windows and more about smarter context engineering. — New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.
https://www.infoworld.com/article/4085355/four-important-lessons-about-context-engineering.html
Related News |
25 sources
Current Date
Nov, Thu 27 - 12:19 CET
|







