Meta has just released its next generation of open-source LLM, Meta Llama 3. It is the SOTA of LLMs with better performance than the most capable close-source LLMs! Currently, the Llama3 8b and 70b models are available, and a massive 400b model is expected in the next several months. The Llama3 models were trained on a significantly larger dataset compared to its predecessor, Llama 2, resulting in improved capabilities like reasoning and code generation. Learn more about Meta Llama 3 release here.
In this article, taking Llama-3-8B as an example, we will cover
- How to run Llama-3-8B on your own device
- How to create an OpenAI-compatible API service for Llama-3-8B
We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
To start quickly, you can use the following command line to run Llama-3-8b on your device. The command line will help you download the required software including the LLM runtime, model and the LLM inference app.
bash <(curl -sSfL 'https://raw.githubusercontent.com/LlamaEdge/LlamaEdge/main/run-llm.sh') --model llama-3-8b-instruct
Run Llama-3-8B on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugin wasi_nn-ggml
Step 2: Download the Llama-3-8B model GGUF file. Since the size of the model is 5.73 GB,it could take a while to download.
curl -LO https://huggingface.co/second-state/Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q5_K_M.gguf
Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
That's it. You can chat with the model in the terminal by entering the following command.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat
Llama3 running on LlamaEdge 🚀
— wasmedge (@realwasmedge) April 19, 2024
Run it on your own computer or edge device on any GPU, CPU or accelerator. Extend it with your own code for prompt management, RAG and function calling.https://t.co/3lz2A9kpld pic.twitter.com/vRhjlyNyMR
The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.
Create an OpenAI-compatible API service for Llama-3-8B
An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.
Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Then, download the chatbot web UI to interact with the model with a chatbot UI.
curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz
Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q5_K_M.gguf \
llama-api-server.wasm \
--prompt-template llama-3-chat \
--ctx-size 4096 \
--model-name Llama-3-8B
From another terminal, you can interact with the API server using curl.
url -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user", "content": "write a hello world in Rust"}], "model":"Llama-3-8B"}'
That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!
Talk to us!
Join the WasmEdge discord to ask questions and share insights.
Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!