> For the complete documentation index, see [llms.txt](https://blog.fusionbase.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://blog.fusionbase.com/ai-research/beyond-cot-how-fusion-guide-elevates-ai-reasoning-with-a-two-model-system.md).

# Beyond CoT: How fusion-guide Elevates AI Reasoning with a Two-Model System

Currently, reasoning and chain-of-thought (CoT) processes are among the most significant advancements in improving models' abilities to tackle complex tasks. With the release of OpenAI's "o1," the reasoning capabilities and response quality of AI have seen a substantial boost. However, the inner workings of this system remain largely unknown. There are indications that "o1" may not be a single model but rather an agentic system. Speculation supporting this includes the missing system prompt, the model's "thinking" process, the absence of streaming in the API, and a lack of clarity regarding its parameters.

In this article, we aim to further enhance a model's reasoning capabilities. Rather than training the model to generate reasoning autonomously, we will introduce a two-model approach. One model, which we call the "Guide," is designed specifically to provide step-by-step guidance to solve a given task. This guidance is then incorporated as context for the primary model responsible for generating the final response.

#### Why a 2 model approach?

There are several reasons why this dual-model approach may yield decent results. Firstly, a model trained exclusively to break down problems into smaller sub-tasks could be significantly more effective at this specific task than a general-purpose model of a similar size. Previous experiments have demonstrated that smaller, fine-tuned models can outperform larger, generalized models when focused on a single specific use case. Since the Guide and the response model do not need to be the same size, this setup allows flexibility—for instance, using a small response model alongside a mid-sized Guide.

In a way, this approach mirrors real-world problem-solving: some experts devise the plans, while others execute them. In our experiments, we will demonstrate how even a small model can outperform much larger models when provided with precise guidance.

Anyway - this is an experiment.

### Algorithm

The process begins by sending the initial prompt to the Guide model. The Guide's task is to generate a step-by-step plan that will help the response model craft an accurate and comprehensive answer. This plan, or "guidance," is then incorporated into the original prompt as follows:

1. **Initial Prompt → Guide Model → Guidance Steps**
   * The user's original prompt is fed into the Guide model.
   * The Guide model generates detailed guidance steps tailored to address the prompt.
2. **Injecting Guidance into the Prompt:**
   * The guidance generated by the Guide model is then injected into the initial prompt. The resulting structure looks like this:

     ```markup
     {PROMPT}

     Using the guidance below, focus on the primary objective: to create a perfect and readable response to the given task. Ensure that each section is tailored specifically to the task at hand, not to the guidance provided.
     The following instructions are solely intended to help achieve the main goal. Think about each step to make sure the key points are covered and the goal is achieved:
     {GUIDANCE}
     ```
   * Here, the original prompt remains the central focus, but the guidance provides a structured pathway for the response model to follow.
3. **Guiding the Response Model:**
   * The enhanced prompt, now containing both the original task and the step-by-step guidance, is then sent to the response model.
   * The response model uses this guidance to generate a final answer. It is directed not to focus on the guidance itself but to leverage it to craft a comprehensive and tailored response to the task.

This two-step process effectively guides the response model on how to approach and solve the problem, ensuring that the final output is both accurate and highly relevant to the initial prompt.

{% hint style="danger" %}
As of now, fusion-guide does not support system prompts. Neither for itself nor for the response model!
{% endhint %}

### The model and data

Download: [Model Card on Huggingface](https://huggingface.co/fusionbase/fusion-guide-12b-0.1)

fusion-guide is built upon the Mistral-Nemo 12bn architecture and has been fine-tuned using a custom-designed training dataset. This dataset consists of task-based prompts, with approximately 90% of them in English and 10% in German. The tasks vary in complexity, skewing towards more challenging scenarios, including some problems that intentionally do not have solutions. This range is designed to enhance the model's ability to handle complex and ambiguous situations.

The training dataset follows a specific structure, formatted as `prompt => guidance`. In this setup, each prompt is paired with a corresponding piece of guidance, which outlines step-by-step instructions similar to what the model is expected to generate in its responses. This approach ensures that the model learns not just to provide answers, but to break down complex tasks systematically.

fusion-guide is available on Hugging Face under the same license as the original Mistral-Nemo 12bn model. While the model is currently in a research phase, it holds promise for practical applications. However, if it is to be deployed in real-world scenarios, further robustness testing is recommended to ensure reliability and safety.

{% hint style="info" %}
The computational cost for inference increases when using guidance tokens compared to not using them. If the guidance model is larger than the response model, the inference computation typically exceeds that of single-model Chain-of-Thought (CoT) approaches.
{% endhint %}

#### Data Structure

The guidance is currently a step-by-step description using Markdown formatting.

<table><thead><tr><th width="205"></th><th>guidance</th></tr></thead><tbody><tr><td>&#x3C;guidance_prompt>Count the number of 'r's in the word 'strawberry,' and then write a Python script that checks if an arbitrary word contains the same number of 'r's.&#x3C;/guidance_prompt></td><td><p></p><p><strong>Step 1:</strong></p><ul><li><strong>Goal:</strong> Count the number of 'r's in the word 'strawberry'.</li><li><p><strong>Key Points to Cover:</strong></p><ul><li>Identify the word 'strawberry' as the target for counting the letter 'r'.</li><li>Use a simple counting method to determine the number of 'r's in the word.</li><li>The word 'strawberry' contains three 'r's: one in 'straw', one in 'berry', and another in 'berry'.</li></ul></li><li><strong>Separate Section in Final Answer:</strong> Yes</li><li><strong>Section Title (if applicable):</strong> Counting 'r's in 'strawberry'</li></ul><hr><p><strong>Step 2:</strong></p><ul><li><strong>Goal:</strong> Write a Python script that checks if an arbitrary word contains the same number of 'r's as 'strawberry'.</li><li><p><strong>Key Points to Cover:</strong></p><ul><li>Explain the need for a Python script to automate the process of checking for the number of 'r's in a word.</li><li>Introduce the concept of string manipulation in Python, specifically using the <code>count()</code> method to count occurrences of a character.</li><li><p>Provide a step-by-step guide to writing the script:</p><ol><li>Define a function that takes a word as an argument.</li><li>Use the <code>count()</code> method to count the number of 'r's in the input word.</li><li>Compare the count to the number of 'r's in 'strawberry' (which is 3).</li><li>Return a boolean value indicating whether the word has the same number of 'r's as 'strawberry'.</li></ol></li><li>Include a sample usage of the function with a test word.</li></ul></li><li><strong>Separate Section in Final Answer:</strong> Yes</li><li><strong>Section Title (if applicable):</strong> Python Script to Check for 'r's</li></ul><hr><p><strong>Step 3:</strong></p><ul><li><strong>Goal:</strong> Provide the complete Python script and demonstrate its usage.</li><li><p><strong>Key Points to Cover:</strong></p><ul><li>Present the final version of the Python script, ensuring it is clear and well-commented.</li><li>Include a sample test case where the script is used to check a word.</li><li>Ensure the script is ready to be copied and run by the user.</li></ul></li><li><strong>Separate Section in Final Answer:</strong> Yes</li><li><strong>Section Title (if applicable):</strong> Complete Python Script and Usage Example</li></ul></td></tr></tbody></table>

The complete training dataset consists of several thousand similar samples. To utilize fusion-guide inference, the prompt must be structured in the following format:

{% hint style="info" %}

```markup
<guidance_prompt>Count the number of 'r's in the word 'strawberry,' and then write a Python script that checks if an arbitrary word contains the same number of 'r's.</guidance_prompt>
```

{% endhint %}

It is required to put the prompt into the `<guidance_prompt>{PROMPT}</guidance_promt>` tags

### Example

Let's break down how the guidance injection works. First, we'll use Llama 3.1 8bn via the AWS Bedrock API with the following inference parameters:

```python
import boto3

brt = boto3.client("bedrock-runtime")
conversation = [...]
response = brt.converse(
    modelId="meta.llama3-1-8b-instruct-v1:0",
    messages=conversation,
    inferenceConfig={"maxTokens": 2048, "temperature": 0, "topP": 0.95},
)
```

Now lets call the API and check the response

Prompt:

{% hint style="info" %}
Count the number of 'r's in the word 'strawberry,' and then write a Python script that checks if an arbitrary word contains the same number of 'r's.
{% endhint %}

Response without guidance:

{% hint style="info" %}

```python
def count_r(word):
    return word.count('r')

word = 'strawberry'
print(f'The word "{word}" contains {count_r(word)} "r"s.')

# Test the function with an arbitrary word
test_word = input('Enter a word: ')

if count_r(test_word) == count_r('strawberry'):
    print(f'"{test_word}" contains the same number of "r"s as "strawberry".')
else:
    print(f'"{test_word}" does not contain the same number of "r"s as "strawberry".')
```

{% endhint %}

The response largely addresses the prompt. There is no extraneous text, and the number of 'r's in "strawberry" isn't mentioned. However, the code is correct and functions as expected.

Now, let's compare this result to the guidance-injected prompt.

Guidance injected prompt:

{% hint style="info" %}
Count the number of 'r's in the word 'strawberry,' and then write a Python script that checks if an arbitrary word contains the same number of 'r's.\
\
Using the guidance below, focus on the primary objective: to create a perfect and readable response to the given task. Ensure that each section is tailored specifically to the task at hand, not to the guidance provided.

The following instructions are solely intended to help achieve the main goal. Think about each step to make sure the key points are covered and the goal is achieved:

**Step 1:**

* **Goal:** Count the number of 'r's in the word 'strawberry'.
* **Key Points to Cover:**
  * Identify the word 'strawberry' as the target for counting the letter 'r'.
  * Use a simple counting method to determine the number of 'r's in the word.
  * The word 'strawberry' contains three 'r's: one in 'straw', one in 'berry', and another in 'berry'.
* **Separate Section in Final Answer:** Yes
* **Section Title (if applicable):** Counting 'r's in 'strawberry'

***

**Step 2:**

* **Goal:** Write a Python script that checks if an arbitrary word contains the same number of 'r's as 'strawberry'.
* **Key Points to Cover:**
  * Explain the need for a Python script to automate the process of checking for the number of 'r's in a word.
  * Introduce the concept of string manipulation in Python, specifically using the `count()` method to count occurrences of a character.
  * Provide a step-by-step guide to writing the script:
    1. Define a function that takes a word as an argument.
    2. Use the `count()` method to count the number of 'r's in the input word.
    3. Compare the count to the number of 'r's in 'strawberry' (which is 3).
    4. Return a boolean value indicating whether the word has the same number of 'r's as 'strawberry'.
  * Include a sample usage of the function with a test word.
* **Separate Section in Final Answer:** Yes
* **Section Title (if applicable):** Python Script to Check for 'r's

***

**Step 3:**

* **Goal:** Provide the complete Python script and demonstrate its usage.
* **Key Points to Cover:**
  * Present the final version of the Python script, ensuring it is clear and well-commented.
  * Include a sample test case where the script is used to check a word.
  * Ensure the script is ready to be copied and run by the user.
* **Separate Section in Final Answer:** Yes
* **Section Title (if applicable):** Complete Python Script and Usage Example
  {% endhint %}

Response with guidance:

{% hint style="info" %}

**Counting 'r's in 'strawberry'**

The word 'strawberry' contains three 'r's: one in 'straw', one in 'berry', and another in 'berry'.

**Python Script to Check for 'r's**

To automate the process of checking for the number of 'r's in a word, we can use a Python script. We will use the `count()` method to count the occurrences of the character 'r' in the input word.

Here is a step-by-step guide to writing the script:

1. Define a function that takes a word as an argument.
2. Use the `count()` method to count the number of 'r's in the input word.
3. Compare the count to the number of 'r's in 'strawberry' (which is 3).
4. Return a boolean value indicating whether the word has the same number of 'r's as 'strawberry'.

Here is the Python script:

```python
def has_three_rs(word):
    # Count the number of 'r's in the input word
    r_count = word.count('r')
    
    # Compare the count to the number of 'r's in 'strawberry'
    return r_count == 3

# Test the function with a sample word
print(has_three_rs('strawberry'))  # Should return True
print(has_three_rs('grape'))  # Should return False
```

**Complete Python Script and Usage Example**

Here is the complete Python script, including a sample test case:

```python
def has_three_rs(word):
    # Count the number of 'r's in the input word
    r_count = word.count('r')
    
    # Compare the count to the number of 'r's in 'strawberry'
    return r_count == 3

# Test the function with a sample word
print(has_three_rs('strawberry'))  # Should return True
print(has_three_rs('grape'))  # Should return False
```

You can copy and run this script to test it with different words.
{% endhint %}

The response is far more detailed than the one without guidance. It not only answers the prompt but also correctly identifies that "strawberry" contains three 'r's and pinpoints their locations. However, this comparison is essentially between an 8 billion parameter model and an 8+12 billion parameter model, which are not directly comparable in terms of computational requirements. Therefore, for a more accurate evaluation, we will test it against the 70 billion and even the 405 billion parameter models.

### Evaluation

For the evaluation, we utilized LLama 3.1 through AWS Bedrock, testing three model sizes: 8 billion, 70 billion, and 405 billion parameters. The experimental setup involved comparing the 8 billion parameter model enhanced with a fusion-guide (effectively bringing it to 20 billion parameters) against the 70 billion and 405 billion parameter models. Directly comparing the raw 8 billion model would be less meaningful, as the guidance model is already more robust than the 8 billion parameter LLama model. Thus, we chose to only compare it to models with a higher parameter count.

Evaluating textual responses, particularly those involving creativity and readability, presents a challenge due to their inherently subjective nature. While blind-tested human expert evaluation might be the ideal method to assess the quality of responses, for this experiment, we used a stronger model as the judge: GPT-4o. This approach is similar to the one used in WildBench. However, we acknowledge that using an evaluation model has its limitations. For instance, we observed that the order in which samples were provided affected the ranking. To mitigate this bias, we randomized the order of the samples and conducted the test six times.

#### Limitation&#x20;

The test dataset consists of only 20 samples, which is relatively small and should be considered a "preview" rather than a comprehensive evaluation. Additionally, the prompts used are fairly short, single-query questions. We also experimented with longer prompts, including some with code samples, and observed promising results. However, it is important to note that the current fusion-guide model is not specifically trained for long, multi-turn conversations.

Another limitation is the accuracy of the judging model. On occasion, it produced incorrect rankings of the model responses, such as misjudging simple facts—e.g., the classic mistake of stating that "strawberry" has two 'r's instead of three.

#### Results

In all but one run, the guided 8 billion parameter model consistently outperformed both the 70 billion and 405 billion parameter models in terms of response quality. Interestingly, the 405 billion model often underperformed compared to the 70 billion model, a result that contradicts our expectations. The reasons for this discrepancy remain unclear and warrant further investigation. It could be due to the nature of the prompts used or possibly because AWS Bedrock employs a quantized version of the 405 billion model. However, we have not yet been able to verify this.

<figure><img src="/files/1QNlkBeNYiXrbCxLk3pA" alt=""><figcaption><p>Evaluation Run 1</p></figcaption></figure>

<div><figure><img src="/files/QFGBd7opefOLh7oV0Qbf" alt=""><figcaption><p>Evaluation Run 2</p></figcaption></figure> <figure><img src="/files/smhc3HI5LQzZXyzUF7Xo" alt=""><figcaption><p>Evaluation Run 3</p></figcaption></figure> <figure><img src="/files/ahEibiRYCnJHgUp9T9Cf" alt=""><figcaption><p>Evaluation Run 4</p></figcaption></figure> <figure><img src="/files/CtQIplm4XgV9T7KIP0Kg" alt=""><figcaption><p>Evaluation Run 5</p></figcaption></figure> <figure><img src="/files/XoWLWkpXQ23vCwuvGHIp" alt=""><figcaption><p>Evaluation Run 6</p></figcaption></figure></div>

The full evaluation scripts are available on Github:

<https://github.com/FusionbaseHQ/fusion-guide-evaluation-script>

### Future Outlook

Currently, fusion-guide supports a context length of up to 8k tokens but struggles with long-context and long-prompt scenarios. Additionally, the guidance is currently injected directly into the user's prompt. While this approach is not a major issue, we believe it could be improved by handling guidance similarly to function calling—using separate tags and tokens to distinguish the guidance clearly. The format of the guidance also requires enhancement; at present, it is provided in markdown format, chosen for its compatibility with most LLMs.

Another limitation is the lack of support for system prompts, and the model has not been trained or tested for multi-turn conversations. Currently, the guidance is simply appended to the prompt, but with some architectural changes, this process could be made more seamless and happen "under the hood," abstracted from the user.

We are open to additional ideas, comments, and criticisms. If you have thoughts on how to improve this further, feel free to reach out—we welcome the discussion!

<patrick@fusionbase.com>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.fusionbase.com/ai-research/beyond-cot-how-fusion-guide-elevates-ai-reasoning-with-a-two-model-system.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
