Beyond the Prompt: Why Your RAG System May Be Underperforming
Faced with the question “What is the capital of the Netherlands?” you have a few possible responses:
Large Language Models (LLMs) face the same challenge. They excel when a question falls inside their training data, but when it doesn’t, they may “hallucinate,” producing an answer that sounds plausible but is wrong.
The key difference is that LLMs don’t have direct access to your enterprise data or knowledge bases without additional retrieval methods. That’s where Retrieval-Augmented Generation (RAG) comes in.
RAG in a Nutshell
RAG is the process of giving an LLM access to relevant, external information so it can answer queries more accurately. The typical RAG workflow looks like this:
- User query: A user asks a question.
- Retrieval: A separate system searches a knowledge base for relevant documents or data.
- Augmentation: The retrieved content is combined with the query and sent to the LLM, which generates a response.
The value of RAG is that it allows models of any size to deliver high-quality, context-aware answers, whether it’s the latest company policy, current product details, or niche industry knowledge. But RAG doesn’t operate in isolation. For RAG to deliver consistently, it needs to be part of a well-designed information environment, also known as context engineering.
The Shift from Prompt to Context Engineering
In the early days, “prompt engineering” was the art of crafting the right wording to get the right answer. But as AI systems have grown more complex, the industry has realized that context quality of context matters more than the cleverness of the prompt.
Context engineering builds the full information environment around the LLM, not just the immediate instruction, but also system settings, past conversation history, retrieved documents, tools, and output formats.
RAG is a critical part of context engineering, ensuring that the model’s “world” includes the exact information needed for the task.
It’s Not Your RAG, It’s Your Context
In real-world deployments, many RAG systems disappoint, and the issue is almost never the model. It’s bad context engineering. Common pitfalls include:
- Irrelevant retrieval: Pulling the wrong documents wastes tokens and distracts the model.
- Excessive retrieval: Overloading the context window with too much data.
- Token limits and truncation: Cutting off content can cause the model to miss critical context.
- Incomplete context: Missing critical information like user profiles or prior steps.
Imagine an AI system reviewing legal contracts that confidently reports a key clause is missing. In reality, the clause exists, but the retrieval process never pulled it into the model’s context. This kind of gap shows why careful retrieval design is essential.
Engineering Retrieval for Success
Preventing these failures starts with designing retrieval around the business use case:
- Score for relevance: Don’t just match keywords; ensure retrieved content truly answers the question.
- Chunk intelligently: Break documents into logical, searchable segments.
- Compress when needed: Summarize or strip redundancy to avoid token waste.
- Preserve essentials: Keep high-priority context like instructions and user state intact.
Done well, RAG produces grounded, fresh, scalable, and personalized AI outputs. But in many real-world environments, not all the information you need is text. From images and videos to audio clips and charts, handling different content formats introduces new retrieval challenges — and that’s where multi-modal context comes in.
Handling Multi-Modal Context
Most embedding models are optimized for a single type of data, and text models usually outperform others. Multi-modal embeddings (for example, image plus text models) often underdeliver in production.
A surprisingly effective solution is to convert all content to text before retrieval.
For example:
- Images: Use a vision-language model to generate captions.
- Videos with speech: Transcribe audio using a tool like Whisper.
- Videos without speech: Extract keyframes and caption them.
By indexing text representations, retrieval accuracy for non-text content improves dramatically.
RAG in the Real World
OneSix built an AI-powered chatbot for a higher education client to help students get answers faster.
By applying RAG, the chatbot summarized thousands of unstructured documents, giving students accurate answers instantly and helping the university better serve its community.
Real-world RAG success comes from context engineering, feeding models the right information to deliver accurate, reliable, business-ready answers.
Ready to unlock the full potential of RAG?
At OneSix, we design and deploy Retrieval-Augmented Generation systems built for the real-world. We engineer context, optimize retrieval, and integrate AI into your workflows—so your models deliver accurate, reliable, measurable results.
Let’s talk about how we can turn your AI ideas into measurable results.
Co-written by
Matt Altberg, Lead ML Engineer Francisco Gonzalez, Sr. Architect
Published
August 19, 2025