RAG Explained: Reranking for Better Answers – towardsdatascience.com

Spread the love

Publish AI, ML & data-science insights to a global community of data professionals.
How reranking improves retrieval-augmented generation by surfacing the most relevant results
In my last post, we took a look at how the retrieval mechanism of a RAG pipeline works. In a RAG pipeline, relevant documents from a knowledge base are identified and retrieved based on how similar they are to the user’s query. More specifically, the similarity of each text chunk is quantified using a retrieval metric, like cosine similarity, L2 distance, or dot product as a measure, then the text chunks are ranked based on their similarity scores, and finally, we pick the top text chunks that are the most similar to the user’s query.
Unfortunately, high similarity scores don’t always guarantee perfect relevance. In other words, the retriever may retrieve a text chunk that has a high similarity score, but is in fact not that useful – just not what we need to answer our user’s question 🤷🏻‍♀️. And this is where re-ranking is introduced, as a way to refine results before feeding them into the LLM.
As in my previous posts, I will once again be using the War and Peace text as an example, licensed as Public Domain and easily accessible through Project Gutenberg.
🍨DataCream is a newsletter offering stories and tutorials on AI, data, tech. If you are interested in these topics, subscribe here.
• • •
Text chunks retrieved solely based on a retrieval metric – that is, raw retrieval– may not be that useful for several different reasons:
Back to my favorite question from the ‘War and Peace’ example, if we ask ‘Who is Anna Pávlovna?’, and use a very small k (like k = 2), the retrieved chunks may not contain enough information to comprehensively answer the question. Conversely, if we allow for a large number of chunks k to be retrieved (say k = 20), we are most probably going to also retrieve some irrelevant text chunks where ‘Anna Pávlovna’ is just mentioned, but isn’t the topic of the chunk. Thus, the meaning of some of those chunks is going to be unrelated to the user’s query and useless for answering it. Therefore, we need a way to distinguish the truly relevant retrieved text chunks out of all the retrieved chunks.
Here, it is worth clarifying that one straightforward solution for this issue would be just retrieving everything and passing everything to the generation step (to the LLM). Unfortunately, this cannot be done for a bunch of reasons, like that the LLMs have certain context windows, or that the LLMs’ performance gets worse when overstuffing with information.
So, this is the issue we try to tackle by introducing the reranking step. In essence, reranking means re-evaluating the chunks that are retrieved based on the cosine similarity scores with a more accurate, yet also more expensive and slower method.
There are various methods for doing this, as for instance, cross-encoders, employing an LLM to do the reranking, or using heuristics. Ultimately, by introducing this extra reranking step, we essentially implement what is called a two-stage retrieval with reranking, which is a standard industry approach. This allows for improving the relevance of the retrieved text chunks and, as a result, the quality of the generated responses.
So, let’s take a more detailed look… 🔍
• • •
Cross-encoders are the standard models used for reranking in a RAG framework. Unlike retriever functions used in the initial retrieval step, which just take into account the similarity scores of different text chunks, cross-encoders are able to perform a more in-depth comparison of each of the retrieved text chunks with the user’s query. More specifically, a cross encoder jointly embeds a document and the user’s query and produces a similarity score. On the flip side, in cosine similarity-based retrieval, the document and the user’s query are embedded separately from one another, and then their similarity is calculated. As a result, some information of the original texts is lost when creating the embeddings separately, and some more information is preserved when the texts are jointly embedded. Consequently, a cross encoder can assess better the relevance between two text chunks (that is, the user’s query and a document).
So why not use a cross-encoder in the first place? The answer is because cross-encoders are very slow. For instance, a cosine similarity search for about 1,000 passages takes less than a millisecond. On the contrary, using solely a cross-encoder (like ms-marco-MiniLM-L-6-v2) to search the same set of 1,000 passages and match for a single query would be orders-of-magnitude slower!
This is to be expected if you think about it, since using a cross-encoder means that we have to pair each chunk of the knowledge base with the user’s query and embed them on the spot, and for each and every new query. On the contrary, with cosine similarity-based retrieval, we get to create all the embeddings of the knowledge base beforehand, and just once, and then once the user submits a query, we just need to embed the user’s query and calculate the pairwise cosine similarities.
For that reason, we adjust our RAG pipeline appropriately and get the best of both worlds; first, we narrow down the candidate relevant chunks with the cosine similarity search, and then, in the second step, we assess the similarity of the retrieved chunks more accurately with a cross-encoder.
• • •
So now let’s see how all these play out in the ‘War and Peace’ example by answering one more time my favorite question – ‘Who is Anna Pávlovna?’.
My code so far looks something like this:
For k = 2, we get the following top chunks retrieved.
But, if we set k = 6, we get the following chunks retrieved, and somewhat of a more informative answer, containing additional data on our question, like the fact that she’s ‘maid of honor and favorite of the Empress Márya Fëdorovna’.
Now, let’s adjust our code to rerank those 6 chunks and see if the top 2 remain the same. To do this, we will be using a cross-encoder model to re-rank the top-k retrieved documents before passing them to your LLM. More specifically, I will be utilizing the cross-encoder/ms-marco-TinyBERT-L2 cross-encoder, which is a simple, pre-trained cross-encoding model, running on top of PyTorch. To do so, we also need to import the torch and transformers libraries.
Then we can initialise the cross-encoder and define a function for reranking the top k chunks retrieved from the vector search:
… and also adjust of function as follows:
… and finally, these are the top 2 chunks, and the respective answer we get, after re-ranking with the cross-encoder:
Notice how these 2 chunks are different from the top 2 chunks we got from the vector search.
Thus, the importance of the reranking step is rendered clearly. We use the vector search to narrow down the possibly relevant chunks, out of all the available documents in the knowledge base, but then use the reranking step to identify the most relevant chunks accurately.
We can imagine the two-step retrieval as a funnel: the first stage pulls in a wide set of candidate chunks, and the reranking stage filters out the irrelevant ones. What’s left is the most useful context, leading to clearer and more accurate answers.
• • •
So, it becomes apparent is an essential step for building a robust RAG pipeline. Fundamentally, it allows us to bridge the gap between the quick but not so precise vector search, and context-aware answers. By performing a two-step retrieval, with the vector search being the first step, and the second step being the reranking, we get the best of both worlds: efficiency at scale and higher quality responses. In practice, this two-stage approach is what makes modern RAG pipelines both practical and powerful.
✨Thank you for reading!✨
. . .
If you made it this far, you might find pialgorithms useful — a platform we’ve been building that helps teams securely manage organizational knowledge in one place.
. . .
Loved this post? Let’s be friends! Join me on:
📰Substack 💌 Medium 💼LinkedIn Buy me a coffee!
Written By

Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
A deep dive on the meaning of understanding and how it applies to LLMs
Explore the secret behind Sora’s state-of-the-art videos
Explore the concepts behind the interpretability quest for LLMs
Employers, ditch the AI detection tools and ask one important question instead.
You can ask ChatGPT to act in a million different ways: as your nutritionist, language…
# Retrieval Augmented Generation, or RAG, is all the rage these days because it introduces…
How Human Labor Enables Machine Learning We don’t talk enough about how much manual, human…
Your home for data science and Al. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top