In our previous article, we got a peek into the lastest interview on the founder of DeepSeek.
DeepSeek R1 is a powerful and versatile open source LLM model that challenges established players like OpenAI with its advanced reasoning capabilities, cost-effectiveness, and open-source availability. While it has some limitations, its innovative approach and strong performance make it a valuable tool for developers, researchers, and businesses alike. For those interested in exploring its capabilities, the model and its distilled versions are readily accessible on platforms like Hugging Face and GitHub.
Trained by a Chinese team under the GPU sanctions, it is surprisingly good at things like math, coding, and even some pretty complex reasoning. What I find most interesting is that it's a “distilled” model, meaning it's smaller and more efficient than the giant models it was based on. This is huge because it makes it much more practical for people to actually use and build upon.
In this article, we will cover
- How to run open source DeepSeek models on your own device
- How to create an OpenAI-compatible API service with newest DeepSeek models
We will use the Rust + Wasm stack to develop and deploy applications for this model. There are no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
Run the DeepSeek-R1-Distill-Llama-8B model on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1
Step 2: Download the quantized modelDeepSeek-R1-Distill-Llama-8B-GGUFfile. It may take a long time, since the size of the model is 5.73 GBs.
curl -LO https://huggingface.co/second-state/DeepSeek-R1-Distill-Llama-8B-GGUF/resolve/main/DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf
Step 3: Download the LlamaEdge API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Step 4: Download the chatbot UI for interacting with the DeepSeek-R1-Distill-Llama-8B model in the browser.
curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz
Next, use the following command lines to start an LlamaEdge API server for the model.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf \
llama-api-server.wasm \
--prompt-template llama-3-chat \
--ctx-size 8096
Then, open your browser to http://localhost:8080
to start the chat!
Or you can send an API request to the model.
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the capital of France?"}], "model": "DeepSeek-R1-Distill-Llama-8B"}'
{"id":"chatcmpl-68158f69-8577-4da2-a24b-ae8614f88fea","object":"chat.completion","created":1737533170,"model":"default","choices":[{"index":0,"message":{"content":"The capital of France is Paris.\n</think>\n\nThe capital of France is Paris.<|end▁of▁sentence|>","role":"assistant"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":34,"completion_tokens":18,"total_tokens":52}}
Create an OpenAI-compatible API service for DeepSeek-R1-Distill-Llama-8B
LlamaEdge is lightweight and does not require a daemon or sudo process to run. It can be easily embedded into your own apps! With support for both chat and embedding models, LlamaEdge could become an OpenAI API replacement right inside your app on the local computer!
Next we will show you how to start a full API server for theDeepSeek-R1 model along with an embedding model. The API server will have chat/completions
and embeddings
endpoints. In addition to the steps in the previous section, we will also need to:
Step 5: Download an embedding model.
curl -LO https://huggingface.co/second-state/Nomic-embed-text-v1.5-Embedding-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf
Then, we can use the following command line to start the LlamaEdge API server with both chat and embedding models. For more detailed explanation, check out the doc start a LlamaEdge API service.
wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:DeepSeek-R1-Distill-Llama-8B-Q5_K_M.gguf \
--nn-preload embedding:GGML:AUTO:nomic-embed-text-v1.5.f16.gguf \
llama-api-server.wasm -p llama-3-chat,embedding \
--model-name DeepSeek-R1-Distill-Llama-8B,nomic-embed-text-v1.5.f16 \
--ctx-size 8192,8192 \
--batch-size 128,8192 \
--log-prompts --log-stat
Finally, you can follow these tutorials to integrate the LlamaEdge API server as a drop-in replacement for OpenAI with other agent frameworks. Specially, use the following values in your app or agent configuration to replace the OpenAI API.
Config option | Value |
---|---|
Base API URL | http://localhost:8080/v1 |
Model Name (for LLM) | DeepSeek-R1-Distill-Llama-8B |
Model Name (for Text embedding) | nomic-embed |
That’s it! Access the LlamaEdge repo and build your first agent today! If you have fun building and exploring, be sure tostar the repo HERE.
Join the WasmEdge discord to ask questions and share insights. Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!