Google open sourced its Gemma models family yesterday, finally joining the open-source movement in large language models.
Gemma-2b-it, like Gemma-7b-it we have discussed, is also designed for a range of text generation tasks like question answering, summarization, and reasoning. These lightweight, state-of-the-art models are built on the same technology as the Gemini models, offering text-to-text, decoder-only capabilities. They are available in English, with open weights, pre-trained variants, and instruction-tuned versions, making them suitable for deployment in resource-constrained environment.
In this article, taking Gemma-2b-it as an example, we will cover
- How to run Gemma-2b-it on your own device
- How to create an OpenAI-compatible API service for Gemma-2b-it
We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
(LlamEdge has just released version 0.3.2, supporting Gemma models)
Run Gemma-2b-it on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
Step 2: Download the Gemma-2b-it model GGUF file. Since the size of the model is 5.88G so it could take a while to download.
curl -LO https://huggingface.co/second-state/Gemma-2b-it-GGUF/resolve/main/gemma-2b-it-Q5_K_M.gguf
Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
That's it. You can chat with the model in the terminal by entering the following command.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-2b-it-Q5_K_M.gguf llama-chat.wasm -p gemma-instruct -c 4096
The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.
[You]:
Write an email to my boss letting them know I want to take a day off because I'm sick
[Bot]:
Subject: Absence Due to Illness - [Your Name]
Dear Boss,
I'm writing you today requesting a day off tomorrow due to illness. I have been experiencing symptoms of (list your specific Symptoms) since yesterday and do not feel well enough at the moment for work . My current condition is making me unable perform my duties effectively , therefore i kindly ask if it would be possible to take this upcoming workday as paid sick leave
I will keep you updated on when I expect to return. If there are any urgent matters that need attention, please let me know and ill address them upon returning . Thank You for your understanding in advance!
Sincerely ,
[Your Name]
Create an OpenAI-compatible API service for Gemma-2b-it
An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.
Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Then, download the chatbot web UI to interact with the model with a chatbot UI.
curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz
Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!
wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-2b-it-Q5_K_M.gguf llama-api-server.wasm -p gemma-instruct -c 4096
From another terminal, you can interact with the API server using curl.
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a sentient, superintelligent artificial general intelligence, here to teach and assist me."}, {"role":"user", "content": "Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world."}], "model":"Gemma-2b-it"}'
That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try!
Talk to us!
Join the WasmEdge discord to ask questions and share insights.
Any questions getting this model running? Please go to second-state/LlamaEdge to raise an issue or book a demo with us to enjoy your own LLMs across devices!