Getting Started with QwQ-32B

Qwen/QwQ-32B is the latest version of the Qwen seriesl. It is the medium-sized reasoning model, designed to excel at complex tasks with deep thinking and advanced problem-solving abilities.

Unlike traditional instruction-tuned models, QwQ harnesses both extensive pretraining and a reinforcement learning stage during post-training to deliver significantly enhanced performance, especially on challenging problems with 32.5 billion total parameters.

In this article, we will cover how to run and interact with QwQ-32B-GGUF on your own edge device.

We will use the Rust + Wasm stack to develop and deploy applications for this model. There are no complex Python packages or C++ toolchains to install! See why we choose this tech stack.

Run QwQ-32B-GGUF

Step 1: Install WasmEdge via the following command line.

Make sure to install the lastest version:

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1

Step 2: Download the QwQ-32B-GGUF Model File

Download the quantized model file for QwQ-32B-GGUF **** model file. The model is 23.3 GB in size and it will take a long time to download it. If you want to run a different quantized model, you will need to change the model download link below.

curl -LO https://huggingface.co/second-state/QwQ-32B-GGUF/resolve/main/QwQ-32B-Q5_K_M.gguf

Step 3: Download the LlamaEdge API Server

Download the LlamaEdge API server Wasm app, which is cross-platform and runs on many CPU and GPU devices:

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Step 4: Download the Chatbot UI to interact with the QwQ-32B-GGUF model in the browser.

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

Step 5: Start the LlamaEdge API Server

Use the following command line to start the LlamaEdge API server for QwQ-32B-GGUF. LlamaEdge provides an OpenAI compatible API, and you can connect any chatbot client or agent to it!

wasmedge --dir .:. --nn-preload default:GGML:AUTO:QwQ-32B-Q5_K_M.gguf \
  llama-api-server.wasm \
  --model-name QwQ-32B \
  --prompt-template chatml \
  --ctx-size 128000

Chat with QwQ-32B-GGUF

Once the server is running, open your browser and visit http://localhost:8080 to chat with the QwQ-32B-GGUF model!

Use the API

The LlamaEdge API server is fully compatible with OpenAI API specifications. You can send API requests to the model. For example:

curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Can a person be at the North Pole and the South Pole at the same time?"}
  ],
  "model": "QwQ-32B-Q5_K_M"
}'

The API will return a response in a structure similar to the OpenAI chat completion API.

RAG and Embeddings

If you plan to build agentic or Retrieval-Augmented Generation (RAG) applications with QwQ-32B-GGUF, you might need an API to compute vector embeddings for user input. This can be done by adding an embedding model to your LlamaEdge API server. Learn how to do this.

Gaia Integration

Alternatively, the Gaia network software lets you deploy the Mistral LLM, embedding model, and a vector knowledge base in a single command. Try it with QwQ-32B-GGUF.

For further assistance or to share insights, join the WasmEdge Discord or visit the LlamaEdge GitHub repository to raise issues.

Happy deploying and exploring the power of QwQ-32B-GGUF!