How To Use RAG To Improve Your LLM's Reasoning Skills

March 7, 2024 · 12 min read

CEO at Morgan Moneywise, Inc.

picture of gears to represent integration tests

Introduction

Retrieval Augmented Generation (RAG) typically finds its place in enhancing document-based question answering (QnA), effectively leveraging extensive databases to provide contextually relevant information for Large Language Models (LLMs) to formulate precise answers. Traditionally, when looking to boost the reasoning capabilities of LLMs, the go-to strategy has been fine-tuning these models with additional data. However, fine-tuning is not only resource-intensive but also presents scalability challenges.

Interestingly, RAG could potentially offer a more efficient pathway to enhance LLMs' reasoning skills without the hefty costs of fine-tuning. This intriguing premise is explored in depth in Enhancing LLM Intelligence with ARM-RAG: Auxiliary Rationale Memory for Retrieval Augmented Generation by Eric Melz, which proposes a novel use of RAG beyond its conventional application, aiming to refine and expand the problem-solving prowess of LLMs efficiently.

This blog post will take a deep dive into the mechanics of ARM-RAG, specifically focusing on how it utilizes RAG to craft prompts that refine the reasoning skills of LLMs. By walking through an example, we aim to vividly illustrate this process. Further, we'll discuss the results and examine the limitations of this approach.

Setup

Let's say we want the LLM to answer a maths question like the following:

Ray buys a pack of hamburger meat for $5.00, a box of crackers for $3.50, 4 bags
of frozen vegetables at $2.00 per bag and a pack of cheese for $3.50 at the
grocery store.  Because he is a store rewards member, he gets 10% off of his
purchase.  What does his total grocery bill come to?

The answer is $18.00 because:

List the Prices and Quantities of Each Item:
- Hamburger meat: $5.00
- Box of crackers: $3.50
- Bags of frozen vegetables: $2.00 per bag (4 bags)
- Pack of cheese: $3.50
Calculate the Total Cost of the Vegetables:
- Since the vegetables are $2.00 per bag and Ray buys 4 bags: `4 bags x$ 2.00 per bag = $8.00`
Add Up the Costs of All Items to Get the Subtotal:
- Combine the cost of the meat, crackers, vegetables, and cheese: $5.00 + $3.50 + $8.00 + $3.50 = $20.00
Calculate the Discount for Being a Store Rewards Member:
- The discount is 10%. Convert this to a decimal (0.10) and multiply by the subtotal: 10% of $20.00 = 0.10 x $20.00 = $2.00
Subtract the Discount from the Subtotal to Find the Total Cost:
- To find the total amount Ray pays, subtract the discount from the subtotal: $20.00 - $2.00 = $18.00
Conclusion: After applying the 10% discount, Ray's total grocery bill is $18.00.

This is a classic word problem that requires a series of mathematical reasoning steps to solve and is part of the GSM8k dataset.

The rest of this article will use gpt-3.5-turbo as the LLM.

Mechanics

Step 1: Obfuscate The Question

The core concept behind ARM-RAG hinges on pulling structurally similar questions from a knowledge base—here, the GSM8k dataset—and leveraging the answers to these questions to craft a prompt that aids the LLM in addressing the presented query.

To accomplish this, the process involves a creative step of obscuring the original question. This is crucial to avoid drawing semantically similar questions, such as those concerning grocery stores or involving individuals named Ray. By randomly substituting names with less common alternatives and non-name nouns with placeholders, we hopefully help the retriever put more emphasis on the structure of the question.

This adjustment transforms the original question into a version that looks like this:

Vero buys a zibberflap of crinklethorp for $5.00, a dinglefrap of plumbuzzle for
$3.50, 4 quarkleflop of spiggotwhap at $2.00 per flibberjig and a zibberflap of
floopernoodle for $3.50 at the snickerblast.  Because he is a store rewards
member, he gets 10% off of his quizzletoe.  What does his slippleshank come to?

Step 2: Retrieve Structurally Similar Questions

We use the obfuscated question to query the knowledge base and retrieve 5 structurally similar questions.

For our example, the retrieved questions are:

1. Question: Zoe goes to the store to buy soda and pizza for herself and her 5 family members. Each bottle of soda costs half a dollar and each slice of pizza costs $1. Before she leaves her house she takes just enough money to pay for her purchase. How much money did Zoe take?

Zoe has to buy 1 soda for each of the 6 family members for a total of 1*6 = <<1*6=6>>6 sodas
Each bottle of soda cost half a dollar and she has to buy 6 so 6*.50 = $<<6*.50=3.00>>3.00 worth of soda
She also has to buy pizza for herself and her 5 family members which will cost 6*1 = $<<6*1=6.00>>6.00
Altogether Zoe took 3+6 = $<<3+6=9.00>>9.00. Answer: \boxed{9}.


2. Question: Zoe goes to the store to buy soda and pizza for herself and her 5 family members. Each bottle of soda costs half a dollar and each slice of pizza costs $1. Before she leaves her house she takes just enough money to pay for her purchase. How much money did Zoe take?

Zoe needs to buy 6 sodas, which would cost 6 * $0.50 = $<<6*0.50=3>>3.
Zoe needs to buy 6 slices of pizza, which would cost 6 * $1 = $<<6*1=6>>6.
Zoe needs to take $3 + $6 = $<<3+6=9>>9. Answer: \boxed{9}.


3. Question: Bob buys nose spray.  He buys 10 of them for a "buy one get one free" promotion.  They each cost $3.  How much does he pay?

Normally they would cost 10*3=$<<10*3=30>>30
But since they were buy one get one free, we only charge him for 10/2=<<10/2=5>>5
So he pays 5*3=$<<5*3=15>>15. Answer: \boxed{15}.


4. Question: Bob buys nose spray.  He buys 10 of them for a "buy one get one free" promotion.  They each cost $3.  How much does he pay?

He bought 10/2=<<10/2=5>>5 pairs
So he paid 5*3=$<<5*3=15>>15. Answer: \boxed{15}.


5. Question: Zoe goes to the store to buy soda and pizza for herself and her 5 family members. Each bottle of soda costs half a dollar and each slice of pizza costs $1. Before she leaves her house she takes just enough money to pay for her purchase. How much money did Zoe take?

Zoe and her family members make a total of 5+1 = <<5+1=6>>6 people.
The six of them will all prefer drink a can of soda each, which is 6*0.5 =<<6*0.5=3>>3 dollars.
The family will also take some pizza, as each person will have a slice for themself, they will take a total of 6*1 = <<6*1=6>>6 dollars.
In total Zoe needed 3+6 = <<3+6=9>>9 dollars. Answer: \boxed{9}.

Did you notice that there's only two unique questions but they have different reasoning chains i.e. the computations used to arrive at the answer? This is because the author actually used an LLM to go over the training set and generate reasoning chains for each question 5 times. The questions and answer pair you see here represent the correct responses from the LLM.

Anyway, are these questions structurally similar to the original question?

Well, hard to say really. As it stands, the idea of "structural similarity" is a bit vague so lets just go with it for now.

Step 3: Build The Prompt

The next step is to build a prompt that uses the retrieved questions, reasoning chain and answer triplet as context.

The prompt for our example looks like this:

Given a math problem, generate an answer with a rationale.

Question / answer pairs have the form

Question: ${question}

${answer}
    
Examples:
    
Question: Zoe goes to the store to buy soda and pizza for herself and her 5 family members. Each bottle of soda costs half a dollar and each slice of pizza costs $1. Before she leaves her house she takes just enough money to pay for her purchase. How much money did Zoe take?

Zoe has to buy 1 soda for each of the 6 family members for a total of 1*6 = <<1*6=6>>6 sodas
Each bottle of soda cost half a dollar and she has to buy 6 so 6*.50 = $<<6*.50=3.00>>3.00 worth of soda
She also has to buy pizza for herself and her 5 family members which will cost 6*1 = $<<6*1=6.00>>6.00
Altogether Zoe took 3+6 = $<<3+6=9.00>>9.00. Answer: \boxed{9}.

Question: Zoe goes to the store to buy soda and pizza for herself and her 5 family members. Each bottle of soda costs half a dollar and each slice of pizza costs $1. Before she leaves her house she takes just enough money to pay for her purchase. How much money did Zoe take?

Zoe needs to buy 6 sodas, which would cost 6 * $0.50 = $<<6*0.50=3>>3.
Zoe needs to buy 6 slices of pizza, which would cost 6 * $1 = $<<6*1=6>>6.
Zoe needs to take $3 + $6 = $<<3+6=9>>9. Answer: \boxed{9}.

Question: Bob buys nose spray.  He buys 10 of them for a "buy one get one free" promotion.  They each cost $3.  How much does he pay?

Normally they would cost 10*3=$<<10*3=30>>30
But since they were buy one get one free, we only charge him for 10/2=<<10/2=5>>5
So he pays 5*3=$<<5*3=15>>15. Answer: \boxed{15}.

Question: Bob buys nose spray.  He buys 10 of them for a "buy one get one free" promotion.  They each cost $3.  How much does he pay?

He bought 10/2=<<10/2=5>>5 pairs
So he paid 5*3=$<<5*3=15>>15. Answer: \boxed{15}.

Question: Zoe goes to the store to buy soda and pizza for herself and her 5 family members. Each bottle of soda costs half a dollar and each slice of pizza costs $1. Before she leaves her house she takes just enough money to pay for her purchase. How much money did Zoe take?

Zoe and her family members make a total of 5+1 = <<5+1=6>>6 people.
The six of them will all prefer drink a can of soda each, which is 6*0.5 =<<6*0.5=3>>3 dollars.
The family will also take some pizza, as each person will have a slice for themself, they will take a total of 6*1 = <<6*1=6>>6 dollars.
In total Zoe needed 3+6 = <<3+6=9>>9 dollars. Answer: \boxed{9}.

Question: Ray buys a pack of hamburger meat for $5.00, a box of crackers for $3.50, 4 bags of frozen vegetables at $2.00 per bag and a pack of cheese for $3.50 at the grocery store.  Because he is a store rewards member, he gets 10% off of his purchase.  What does his total grocery bill come to?

Step 4: Generate And Extract The Answer

The prompt is then fed into the LLM to generate an answer. The answer is then extracted and used as the final answer to the original question.

This is the answer generated by the LLM:

Ray's total grocery bill comes to $5.00 (hamburger) + $3.50 (crackers) + 4 * $2.00 (frozen vegetables) + $3.50 (cheese) = $5.00 + $3.50 + $8.00 + $3.50 = $<<5+3.50+8+3.50=20.00>>20.00 before the discount.
His store rewards gives him a 10% discount, which is $20.00 * 0.10 = $<<20*0.10=2>>2.00 discount.
So, his total grocery bill comes to $20.00 - $2.00 = $<<20-2=18.00>>18.00 after the discount. Answer: \boxed{18.00}.

Notice that the LLM has generated a response that is similar to the reasoning chain in the retrieved questions. And, just like the retrieved questions, the final answer is enclosed in a \boxed{} tag to make it easier to extract.

In this case, the answer generated by the LLM, 18.00, is the correct answer to the original question.

Results

Here's the results reported in the paper:

Experiment	Accuracy
Baseline	73.2%
ARM-RAG Test	75.3%
Obfuscated ARM-RAG Test	77.4%

The Baseline experiment is just the LLM tasked with processing each question from the test set just once.

The difference between ARM-RAG Test and Obfuscated ARM-RAG Test is that the former uses the original question to retrieve similar questions while the latter uses the obfuscated question. In other words, the former skips Step 1 and goes straight to Step 2 in the mechanics described in the preceeding section.

The performances of the different systems show a clear trend. The baseline model, using GPT alone, starts off with an accuracy of 73.2%. Introducing ARM-RAG nudges that number to 75.3%, while the obfuscated ARM-RAG version further increases accuracy to 77.4%. Each step up represents a methodical improvement in the model's ability to reason through math problems.

Limitations

The most glaring limitation in this approach are two-fold:

Obfuscating the question is a bit of a hack. It's not clear how effective it is in practice and it's not clear how to measure its effectiveness
Quality of the retrieved questions. Similar to the above, it is not clear how to measure the quality of the retrieved questions in terms of structural similarity to the original question.

The paper points toward potential future research directions to address these limitations. One of the most promising avenues is to categorize questions within a specific taxonomy or classification system. This approach could refine how questions are interpreted and matched, enhancing the retrieval process and, by extension, the model's reasoning performance.

Conclusion

In conclusion, ARM-RAG introduces an interesting approach to augmenting the problem-solving abilities of LLMs by using RAG to construct better prompts. This method shows a discernible improvement in accuracy over the baseline model, indicating its potential as a tool for more complex reasoning tasks. Although it currently faces challenges with producing and measuring structurally similar questions, it lays the groundwork for future advancements.

Introduction​

Setup​

Mechanics​

Step 1: Obfuscate The Question​

Step 2: Retrieve Structurally Similar Questions​

Step 3: Build The Prompt​

Step 4: Generate And Extract The Answer​

Results​

Limitations​

Conclusion​