RAFT: Integrating RAG with Fine-Tuning

To enhance a language model’s performance on specific topics, you can add more targeted information to its training after it has learned from a large amount of general data. Two main methods are used for this: Retrieval Augmented Generation (RAG) and Fine-Tuning. RAG adds extra knowledge from external sources to the model’s prompts, while Fine-Tuning involves providing the model with additional data to learn from directly. Each method has its advantages and disadvantages, and the choice depends on the project’s requirements. Traditionally, people chose one method or the other, but Retrieval Augmented Fine-Tuning (RAFT) combines both approaches. RAFT offers a new way to train language models, improving their ability to answer questions using external information. This post will explain what RAFT is and how it enhances language model training.

RAG vs. Fine-Tuning

Large language models (LLMs) start by learning from vast amounts of public data, which makes them great at general knowledge tasks. However, they’re now being used more in specialized areas, like providing coding assistance for specific software or answering questions about legal or medical documents. In these cases, accuracy with specific documents is crucial.

To adapt LLMs to these specialized areas, we use two main strategies: retrieval-augmented generation (RAG) and supervised Fine-Tuning.

RAG

RAG allows the model to look up documents to answer questions, similar to having an open book during a test. However, this approach doesn’t fully leverage the opportunity to learn from the specific types of questions it will encounter.

Fine-Tuning

On the other hand, supervised Fine-Tuning focuses on learning the general themes in the documents, improving task performance, and better meeting user needs. However, traditional methods either miss the opportunity to use documents when answering real questions or don’t effectively address mistakes in selecting the right documents to learn from.

Think of it like an open-book test. Current methods with RAG are like taking the test without studying first, while Fine-Tuning methods are like studying by either cramming information or practicing questions without actually using the book they’ll have during the test. Although these methods aim to learn from the specific area, they don’t prepare the model for the reality of having resources available during the actual test.

A recent research paper from UC Berkeley introduces a new approach called retrieval augmented Fine-Tuning (RAFT), which combines supervised Fine-Tuning (SFT) with retrieval augmented generation (RAG).

First, we’ll begin with a look at SFT, and then we’ll dive deeper into RAFT.

Supervised Fine-Tuning (SFT)

In supervised Fine-Tuning (SFT), we use a dataset of questions and answers to train the model. The goal is to improve the model’s ability to provide accurate answers based on its existing knowledge from earlier training or new information gained during Fine-Tuning. Once trained, this model can also use additional documents to assist in finding answers, incorporating the RAG approach. Here’s a simple way to understand how it works:

  • Training: The model learns to go from a question to an answer (Q → A).

  • Testing without extra info (0-shot Inference): It uses what it learned to answer new questions (Q → A).

  • Testing with RAG (RAG Inference): It gets extra documents to help answer questions (Q+D → A).

Retrieval Aware Fine-Tuning (RAFT)

Retrieval Aware Fine-Tuning (RAFT) introduces a new method for preparing training data, particularly for scenarios where models need to handle domain-specific, open-book situations, similar to in-domain RAG. Here’s how it works:

Training Data Setup

  • Components: Each training example includes a question ($Q$), several documents ($D_k$), and a detailed chain-of-thought (CoT) answer ($A^∗$) based on information from one specific document ($D^∗$).
  • Document Types: There are two types of documents:
    • Oracle Documents ($D^*$): Contain the necessary information to answer the question.
    • Distractor Documents ($D_i$): Do not contribute to the answer.

Training Process

  • With Relevant and Distractor Documents: The model is given a question along with the correct document and some distractions. For example: ($Q + D^* + D_2 + … + D_k → A^*$).
  • With Only Distractor Documents: The model receives just a question and distractor documents, with the aim to rely on its own knowledge rather than just the documents provided. For example: ($Q + D_1 + D_2 + … + D_k → A^*$).

Testing

  • During testing, the model gets a question and the top documents retrieved by the RAG setup. RAFT works effectively regardless of the specific tool used for document retrieval.

A key part of the training is teaching the model to explain its answers step by step, similar to showing your work in math homework. We provide the model with all the necessary context and information and then ask it to explain its answer in detail, referring back to the original documents. In the datasets discussed in the paper, this method was used to ensure answers included reasoning. Some datasets, like Gorilla APIBench, already have answers with explanations. Our experiments show that adding these detailed explanations helps the model perform better.

Experiment

The results for the RAFT model, using selected datasets and comparison models, are shown below. RAFT consistently outperforms other models. For instance, compared to the instruction-tuned Llama-2 model, RAFT—especially when combined with RAG—proves to be much more effective at retrieving relevant information from documents and ignoring irrelevant ones. The performance improvement can be as high as 35.25% on Hotpot QA and 76.35% on Torch Hub.

When comparing Retrieval Augmented Fine-Tuning (RAFT) to Domain-Specific Fine-Tuning (DSF) on various datasets, RAFT performs better at utilizing the provided context to answer questions. It shows notable improvements on the HotpotQA and HuggingFace datasets, with increases of 30.87% and 31.41% respectively. However, for PubMed QA, which focuses on yes/no questions, RAFT doesn’t show as significant an improvement over DSF combined with RAG.

Even when compared to a much larger model like GPT-3.5, RAFT demonstrates clear advantages. The LLaMA-7B model, whether used with or without RAG, didn’t perform as well because its answering approach didn’t fit the task requirements. RAFT, with its domain-specific tuning, showed better results. This tuning helps the model learn how to answer questions more effectively.

Simply adding RAG to a domain-specific fine-tuned model doesn’t always improve performance, indicating that the model might need additional training to use context effectively and extract useful information. RAFT addresses this by training the model to better align its answers and enhance its document-processing capabilities, leading to superior performance compared to other methods.

Reference