Lessons from building retrieval systems for AI assistants

Tuesday May 27, 2025. 11:00 AM , from InfoWorld

Retrieval augmented generation (RAG) has quickly risen to become one of the most popular architectures when building AI assistants, especially in scenarios where combining the power of language models with proprietary information is key. Integrating an external knowledge base with transformer models using a RAG architecture allows generative AI systems not only to provide more accurate and factual responses, but also to cite its sources when responding—a significant upgrade for many applications that addresses one of the biggest weaknesses of large language models (LLMs), namely, hallucinations.

That being said, designing and implementing a production-ready RAG architecture that reliably returns accurate and high-quality content from enormous external knowledge bases comes with its own set of engineering challenges. Success depends on a deep understanding of the various components involved in the creation of this architecture, along with a framework through which one can quickly iterate and evaluate how the system responds to updates in the design.

This article walks through the key components of such a system, and offers a few core lessons learned through building and deploying RAG-based systems in production. It is intended for engineers designing and implementing production-ready AI assistants.

Embedding models and similarity metrics

An embedding model creates dense vector representations of texts, ideally with the vector representations capturing the underlying meaning of the text. This allows the documents from the external knowledge base and the queries being asked by the user to share the same vector space. Once two pieces of text share the same vector space, we can then use a similarity metric of our choosing to decide how similar or different the two texts are, as well as to find the n most similar texts for any given text.

The entire foundation of a RAG-based architecture is this two-part process:

Creating a vector representation for a given text (done by the embedding model).

Evaluating how “close” two vectors are in a given vector space (calculated by the similarity metric).

Given this, it’s clear that the choice of which embedding model and similarity metric to use are two important decisions that could have a major impact on the effectiveness of the application.

Embedding models

The choice of an embedding model depends on a few key factors, such as the domain in which it will be applied, the compute and cost limitations, the tradeoff between accuracy and latency, and whether self-hosting is a requirement.

Currently, one of the most popular embedding models is the text-embedding-3-large model from OpenAI. It’s a solid choice with robust performance for general purpose RAG applications, albeit proprietary without the option to self host. On the self-hosting side, there are various options, with Nvidia’s NV-Embed-V2 serving as a popular choice.

My recommendation when choosing an embedding model would be to look at Hugging Face’s MTEB leaderboard, which ranks embedding models on multiple benchmarks and is usually kept up to date. Then try out a few different embedding models with data sampled from your application to evaluate whether a specific model outperforms others on the target domain. This is crucial because there isn’t a “one size fits all” model, and the context of the actual queries is essential when evaluating the performance of a model.

Similarity metrics

When it comes to similarity metrics, the three most popular ones are cosine similarity, dot product, and Euclidean distance.

Cosine similarity measures the angle between the two vectors, capturing how similar their direction is in the high-dimensional space. It ignores the magnitude of the vectors, and simply asks, “Are the two vectors pointing in the same direction?” This is why cosine similarity is often a good choice when you want the RAG application to return texts that are about the same thing, even if one text is significantly more detailed in some way compared to the other.

By contrast, dot product takes both the direction and the magnitude into account. If, for example, the magnitude of the vector encoded information is about the specificity, then the dot product will treat “more specific” vectors as more similar even if their direction is slightly off. While this can sometimes be helpful, it can reduce the effectiveness of the application if the magnitude of the vectors are unrelated to the goals of your application.

Lastly, Euclidean distance or L2 distance measures the distance of the straight line drawn between the two vectors. According to this metric, the closer two vectors are in space, the more similar they are. However, this is often less useful in high dimensional spaces, where distances are often less meaningful than direction.

Your choice of a similarity metric will most likely be between cosine similarity and the dot product. As with the choice of an embedding model, it is often most effective to experiment with both similarity metrics and evaluate the results of both in your specific application.

Chunking

When integrating an external knowledge base into a RAG system, one of the choices that you will need to make is how to split it up when creating the vector representations. One obvious choice is that each document should be added separately. However, you might need to further split a single document into multiple subparts, due to model token context limitations and/or due to a single document potentially spanning many different topics.

The naive approach to chunking is to define a specific context size, say 500 characters, and simply split each document into chunks of this size. Often this approach is not very effective, because it leads to awkward splits in the middle of sentences and the resulting chunks may no longer accurately capture the underlying meaning of the original text.

Usually, a more robust approach is to use the inherent structure of language to chunk on things like paragraph boundaries, section headers, and sentences. This is usually significantly more effective than the naive approach. It’s also common to include some amount of overlap between chunks in order to preserve continuity.

It’s worth mentioning that you could task a language model with determining how to most effectively chunk your documents. While this will probably yield the most effective chunks, it is computationally very expensive and is often unfeasible for moderate to large external knowledge stores.

Hybrid search

While semantic search using vector embeddings performs well for capturing rephrased or paraphrased meanings, it might not do well on searches that involve rare terms or jargon. In these cases, combining semantic search with the more traditional sparse retrieval techniques (BM25 or TF-IDF), which incorporate aspects like keyword frequency, often helps improve the retrieval process. In order to incorporate both of these types of retrieval mechanisms, you could have chunks be assigned both scores, with the final score being a weighted combination of the two, or you could use sparse retrieval as a first-pass filter followed by semantic search.

Reranking – the final step

Once you have run the initial search to retrieve relevant chunks, performing a final step of ranking these results helps to ensure that the most useful information is presented to the user. The reason for this is that although the chunks might technically be similar, they might not be the most helpful answer to the user’s query.

There are a few different ways in which reranking is done in practice. One approach is to use heuristics on certain metadata of the chunks, such as the author, date, source reliability, etc. A benefit of this approach is that it is usually computationally inexpensive and fast.

A second approach is to use cross-encoders—transformer models that take in both the document and the query to compute relevance. This is often very accurate, but it is computationally expensive and thus is usually used only after you have narrowed down the potential candidates to a reasonably small set.

A third approach, somewhat of a middle ground between using heuristics and using cross-encoders, is to use shallow classifiers. These are simpler machine learning models that use pre-extracted features (like word frequency, length, and click-through rates from logs) and apply lighter-weight algorithms like decision trees to calculate the relevance.

Choosing the correct reranking approach involves weighing the tradeoffs between compute, time, and accuracy requirements. Here again, it is often most useful to experiment with different approaches using data from your application domain to make a more informed decision.

Evaluating the results

By far one of the most challenging aspects of implementing and maintaining a RAG-based system is evaluating its performance in the real world. In most cases, labeled relevance judgments are prohibitively expensive to collect at scale. As a result, developers often must rely on proxy metrics to iterate on and improve the system’s performance.

A few key proxy metrics used here are Precision@k (the proportion of the top-k retrieved documents that are actually relevant), Recall@k (the proportion of relevant documents returned out of all that exist), and answer overlap (measures whether the retrieved documents contribute directly to the generated answer).

Additionally, in production settings, it is essential to incorporate implicit feedback—in the form of click-through rates, query rephrasing, and user selections—to help monitor and refine quality over time. When designing evaluation metrics, it’s crucial to align the system with the business goals. For instance, in a customer support assistant, resolution time probably matters a lot more than a metric like Recall@100.

Building a retrieval system for an AI assistant involves making many different decisions for the components that all come together to create a useful assistant. There’s no one right answer to the architectural decisions that need to be made in the process. The domain and business needs should guide your decision-making process.

Finally, keep in mind that the tools available in this field are quickly changing, with theory and documentation often lagging behind. This is why it’s essential to be able to prototype things quickly and see what works best in practice for your application.

Shaurya Pednekar is founding back-end engineer at Undermind.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.