Getting Started with Qwen2-7B-Instruct

Jun 11, 2024 • 4 minutes to read

Meet Qwen2-7B-Instruct, a powerhouse language model from Alibaba! It's the next generation of Qwen models, boasting serious smarts across various tasks. Compared to previous models, Qwen2-7B-Instruct blows past most open-source options and even competes with secretive proprietary models.

This isn't your average language model either. Qwen2-7B-Instruct can handle massive amounts of information, crunching through text up to 131,072 tokens long. That's like tackling a whole book at once! Whether you're working with complex code, trying to solve a mind-bending math problem, or just need some serious language skills, Qwen2-7B-Instruct is ready to impress.

In this article, taking Qwen2-7B-Instruct as an example, we will cover

  • How to run Qwen2-7B-Instruct on your own device
  • How to create an OpenAI-compatible API service for Qwen2-7B-Instruct

We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run Qwen2-7B-Instruct on your own device

Step 1: Install WasmEdge via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s

Step 2: Download the Qwen2-7B-Instruct model GGUF file. Since the size of the model is 5.44 GB so it could take a while to download.

curl -LO https://huggingface.co/second-state/Qwen2-7B-Instruct-GGUF/resolve/main/Qwen2-7B-Instruct-Q5_K_M.gguf

Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

That's it. You can chat with the model in the terminal by entering the following command.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Qwen2-7B-Instruct-Q5_K_M.gguf \
  llama-chat.wasm \
  --prompt-template chatml \
  --ctx-size 4096

The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device. Here is a trick question I asked it.

[You]:
There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room? Explain your reasoning step by step.

[Bot]:
Initially, there are three killers in the room. When someone enters and kills one of them, two scenarios can occur based on the interpretation:

1. **If "someone" refers to an external entity not part of the initial count:**
   - Initially: 3 killers.
   - After: An external entity kills 1, leaving 2 killers + 1 new entity = 3 entities in total, but only 2 are killers.

2. **If "someone" is understood as one of the original three killers acting among themselves:**
   - Initially: 3 killers.
   - After: One of the original killers kills another, leaving 2 killers.

Given no explicit clarification on whether "someone" includes an external entity or refers to one of the original killers, both interpretations are valid. However, in common understanding without further context, it's typically assumed that "someone" does not include an external entity but rather one of the existing individuals. Thus, under this assumption:

**Answer:** There are 2 killers left in the room.

The Qwen2-7B-Instruct model has great logical reasoning capability.

Create an OpenAI-compatible API service for Qwen2-7B-Instruct

An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.

Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Then, download the chatbot web UI to interact with the model with a chatbot UI.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Qwen2-7B-Instruct-Q5_K_M.gguf \
  llama-api-server.wasm \
  --model-name Qwen2-7B-Instruct \
  --prompt-template chatml \
  --ctx-size 4096

From another terminal, you can interact with the API server using curl.

url -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user", "content": "write a hello world in Rust"}], "model":"Qwen2-7B-Instruct"}'

That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!

Talk to us!

Join the WasmEdge discord to ask questions and share insights.

Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!

LLMQwenAI inferenceRustWebAssembly
A high-performance, extensible, and hardware optimized WebAssembly Virtual Machine for automotive, cloud, AI, and blockchain applications