Code Llama is an LLM for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks.
In this article, we will cover
- How to run CodeLlama-13b-hf on your own device
- How to create an OpenAI-compatible API service for CodeLlama-13b-hf
We will use the Rust + Wasm stack to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
Run the model on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
Step 2: Download the model GGUF file. It may take a long time, since the size of the model is several GBs.
curl -LO curl -LO https://huggingface.co/second-state/CodeLlama-13B-Instruct-GGUF/resolve/main/codellama-13b-instruct.Q4_0.gguf
Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
That's it. You can chat with the model in the terminal by entering the following command.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:codellama-13b-instruct.Q4_0.gguf llama-chat.wasm -p codellama-instruct
The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.
On my Mac M1 with 32G memory, it clocks in at about 9.38 tokens per second.
[USER]:
Tell me Rust code to check if a number is a prime number or not
Here is an example of Rust code that checks if a number is a prime number or not:
fn is_prime(n: u64) -> bool { if n <= 1 { return false; } for i in 2..n { if n % i == 0 { return false; } } return true; }
This code uses a simple algorithm to check if a number is prime. It first checks if the number is less than or equal to 1, and if so, it returns false. Then, it iterates from 2 to the square root of the number, and checks if the number is divisible by any of those numbers. If it is, it returns false. If it is not, it returns true.
You can use this code by calling the `is_prime` function with the number you want to check as an argument. For example:
let n = 17; if is_prime(n) { println!("{} is a prime number”, n); } else { println!("{} is not a prime number”, n); }
Create an OpenAI-compatible API service
An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.
Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Then, use the following command lines to start an API server for the model.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:codellama-13b-instruct.Q4_0.gguf llama-api-server.wasm -p codellama-instruct
From another terminal, you can interact with the API server using curl.
curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"Tell me Rust code to check if a number is a prime number or not"}], "model":"codellama-13b-instruct"}'
That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!
Join the WasmEdge discord. Discuss, learn, and share your insights.