Getting Started with Mistral Small

Feb 05, 2025 • 5 minutes to read

Mistral Small 3 is a groundbreaking 24-billion parameter model designed to deliver high-performance AI with low latency. Released under the Apache 2.0 license, it stands out in the AI landscape for its ability to compete with much larger models like Llama 3.3 70B and Qwen 2.5 32B, while being more than three times faster on the same hardware. The model is particularly tailored for agentic tasks — those requiring robust language understanding, tool use, and instruction-following capabilities.

With over 81% accuracy on the MMLU benchmark and an impressive throughput of 150 tokens per second, Mistral Small 3 is an excellent open-source alternative to proprietary models such as GPT4o-mini, especially for developers and organizations seeking a fast and cost-effective solution for generative AI tasks.

In this article, we will cover how to run and interact with Mistral Small 3 on your own edge device. We will use the Rust + Wasm stack to develop and deploy applications for this model. There are no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run Mistral-Small-24B-intruct

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1

Step 2: Download the quantized Mistral-Small-24B-Instruct-2501-GGUF model file. It may take a long time, since the size of the model is 16.3 GBs.

curl -LO https://huggingface.co/second-state/Mistral-Small-24B-Instruct-2501-GGUF/resolve/main/Mistral-Small-24B-Instruct-2501-Q5_K_M.gguf

Step 3: Step 3: Download the LlamaEdge API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Step 4: Download the chatbot UI to interact with the Mistrall-Small-24B model in the browser.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start a LlamaEdge API server for the model. LlamaEdge provides an OpenAI compatible API, and you can connect any chatbot client or agent to it!

wasmedge —dir .:. —nn-preload default:GGML:AUTO:Mistral-Small-24B-Instruct-2501-Q5_K_M.gguf \
llama-api-server.wasm \
--prompt-template mistral-small-tool \
--ctx-size 8096 \
--model-name Mistral-Small-24B-Instruct-2501

Since the Mistral Small 3 is a model with 24B parameters, your machine should have at least 24G ram. If you don’t have such a machine, experience a live chat here.

Chat

Open your browser to http://localhost:8080 to chat with the Mistral Small 3 model!

Use the API

The LlamaEdge API server is fully compatible with OpenAI API specs. You can send an API request to the model.

curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the capital of France?"}], "model": "Mistral-Small-24B-Instruct-2501"}'

{
  "id":"chatcmpl-c624ce9b-6025-4330-aca9-fd169b57d37d",
  "object":"chat.completion",
  "created":1738748461,
  "model":"Mistral-Small-24B-Instruct-2501-Q5_K_M",
  "choices":[
    {
      "index":0,
      "message":{
        "content":"The capital of France is Paris. Known for iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral, Paris is also a global center for art, fashion, gastronomy, and culture. It is located on the Seine River in northern France.",
        "role":"assistant"
      },
      "finish_reason":"stop",
      "logprobs":null
    }
  ],
  "usage":{
    "prompt_tokens":19,
    "completion_tokens":56,
    "total_tokens":75
  }
}

Tool calls

An important feature of the Mistral Small 3 model is its support for tool calls (or, function calls). You can give it a set of external tools or APIs to use, and it will select tools to call in its response. Here is an example. You can ask about the current weather in San Francisco.

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  --data-binary @tooluse.json

The content of the tooluse.json is as follows. It provides the tools and APIs to query weather conditions for any city.

{
    "messages": [
        {
            "role": "user",
            "content": "What is the weather like in San Francisco in Celsius?"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": [
                                "celsius",
                                "fahrenheit"
                            ],
                            "description": "The temperature unit to use. Infer this from the users location."
                        }
                    },
                    "required": [
                        "location",
                        "unit"
                    ]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "predict_weather",
                "description": "Predict the weather in 24 hours",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": [
                                "celsius",
                                "fahrenheit"
                            ],
                            "description": "The temperature unit to use. Infer this from the users location."
                        }
                    },
                    "required": [
                        "location",
                        "unit"
                    ]
                }
            }
        }
    ],
    "tool_choice": "auto",
    "stream": false
}

The response shows a tool call to get the weather for San Francisco.

{
  "id":"chatcmpl-486bd2a8-bf9b-427f-af0c-778a2efed64b",
  "object":"chat.completion",
  "created":1738826914,
  "model":"Mistral-Small-24B-Instruct-2501-Q5_K_M",
  "choices":[
    {
      "index":0,
      "message":{
        "content":"[TOOL_CALLS][{\"name\":\"get_current_weather\",\"arguments\":{\"location\":\"San Francisco, CA\",\"unit\":\"celsius\"}}]",
        "tool_calls":[
          {
            "id":"call_abc123",
            "type":"function",
            "function":{
              "name":"get_current_weather",
              "arguments":"{\"location\":\"San Francisco, CA\",\"unit\":\"celsius\"}"
            }
          }
        ],
        "role":"assistant"
      },
      "finish_reason":"tool_calls",
      "logprobs":null
    }
  ],
  "usage":{
    "prompt_tokens":394,
    "completion_tokens":26,
    "total_tokens":420
  }
}

The agent application will execute the tool call to get the current weather in San Francisco, and then send the result back to the LLM. That allows the LLM to answer the question in the next response.

Learn more about how to use tool calls in LLMs.

RAG and embeddings

Finally, if you are using this model to create agentic or RAG applications, you will likely need an API to compute vector embeddings for the user request text. That can be done by adding an embedding model to the LlamaEdge API server. Learn how this is done.

Gaia network

Alternatively, the Gaia network software allows you to stand up the Mistral LLM, embedding model, and a vector knowledge base in a single command. Try it with Mistral Small 3!


Join the WasmEdge discord to share insights. Any questions about getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

LLMAI inferenceRustWebAssemblyMistral
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications