The mistral-7b-instruct-v0.1 model is a 7B instruction-tuned LLM released by Mistral AI. It is a true open source model licensed under Apache 2.0. It has a context length of 8,000 tokens and performs on par with 13B llama2 models. It is great for generating prose, summarizing documents, and writing code.
In this article, we will cover
- How to run mistral-7b-instruct-v0.1 on your own device
- How to create an OpenAI-compatible API service for mistral-7b-instruct-v0.1
We will use the Rust + Wasm stack to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
Run the model on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
Step 2: Download the model GGUF file. It may take a long time, since the size of the model is several GBs.
https://huggingface.co/second-state/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf
Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
That's it. You can chat with the model in the terminal by entering the following command.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:mistral-7b-instruct-v0.1.Q5_K_M.gguf llama-chat.wasm -p mistral-instruct-v0.1
The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.
On my Mac M1 32G memory device, it clocks in at about 20.71 tokens per second.
[USER]:
What is the capital of France?
[ASSITANT]:
The capital of France is Paris.
[USER]:
The Mistral 7B model is an #opensource #LLM licensed under Apache 2.0. It has a 8k context length and performs on par with many 13B models on a variety of tasks including writing code.
— wasmedge (@realwasmedge) November 16, 2023
With WasmEdge, you can run it on a M1 MacBook at 20 tokens per second — 4x faster than human… pic.twitter.com/RRyHhv3uLS
Create an OpenAI-compatible API service
An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain, and LlamaIndex.
Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Then, use the following command lines to start an API server for the model.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:mistral-7b-instruct-v0.1.Q5_K_M.gguf llama-api-server.wasm -p mistral-instruct
From another terminal, you can interact with the API server using curl.
curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"mistral-7b-instruct-v0.1"}'
That’s all. WasmEdge is the easiest, fastest, and safest way to run LLM applications. Give it a try!
Join the WasmEdge discord to ask questions or share insights.
No time to DIY? Book a Demo with us to enjoy your own LLMs across devices!