Zephyr-7B is fine-tuned Mistral-7B-v0.1 language model, released by the HuggingFace team. It removed the in-built alignment of these datasets boosted performance on MT Bench.
In this article, we will cover
- How to run Zephyr-7B on your own device
- How to create an OpenAI-compatible API service for Zephyr-7B
We will use the Rust + Wasm stack to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
Run the model on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
Step 2: Download the model GGUF file. It may take a long time, since the size of the model is several GBs.
curl -LO https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/blob/main/zephyr-7b-alpha.Q3_K_M.gguf
Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
That's it. You can chat with the model in the terminal by entering the following command.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:zephyr-7b-alpha.Q3_K_M.gguf llama-chat.wasm -chat-llama
The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.
[USER]:
Who are the shareholders of the Federal Reserve?
[ASSISTANT]:
The Federal Reserve is a government-owned central bank, not a private corporation. Its shareholders are the member banks, but they do not have any ownership stake or voting rights in the Fed's decision-making process. The Fed is an independent agency of the U.S. government, and its board of governors is appointed by the President and confirmed by the Senate.
[USER]:
Create an OpenAI-compatible API service
An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain, and LlamaIndex.
Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Then, use the following command lines to start an API server for the model.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:zephyr-7b-alpha.Q3_K_M.gguf llama-api-server.wasm -p chatml
From another terminal, you can interact with the API server using curl.
curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"What is the capital of France?"}], "model":"Zephyr-7B"}
That’s all. WasmEdge is the easiest, fastest, and safest way to run LLM applications. Give it a try!
Join the WasmEdge discord. Discuss, learn, and share your insights.