To quick start, you can run TinyLlama-1.1B-Chat-v1.0 with just one single command on your own device. The command tool automatically downloads and installs the WasmEdge runtime, the model files, and the portable Wasm apps for inference.
The TinyLlama team recently launched TinyLlama-1.1B-Chat version 1.0. Compared with the previous versions, TinyLlama-1.1B-Chat-v1.0 model pretrains a 1.1B Llama model on 3 trillion tokens. This specific model is based on the Llama 2 architecture and tokenizer, and it has been fine-tuned for text generation tasks, making it suitable for generating conversational responses. The model is designed to be compact with only 1.1B parameters, allowing it to be used in various applications with restricted computation and memory requirements.
We will cover:
- Run TinyLlama-1.1B-Chat-v1.0 on your own device
- Create an OpenAI-compatible API service for TinyLlama-1.1B-Chat-v1.0
We will use LlamaEdge (the Rust + Wasm stack) to develop and deploy applications for this model. There is no complex Python packages or C++ toolchains to install! See why we choose this tech stack.
Run TinyLlama-1.1B-Chat-v1.0 on your own device
Step 1: Install WasmEdge via the following command line.
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
Step 2: Download the TinyLlama-1.1B-Chat-v1.0-Instrcut-v0.1GGUF file. It won’t take a long time, since the size of the model is under 1 GB.
curl -LO https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf
Step 3: Download a cross-platform portable Wasm file for the chat app. The application allows you to chat with the model on the command line. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
That's it. You can chat with the model in the terminal by entering the following command.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf llama-chat.wasm -p chatml
The portable Wasm app automatically takes advantage of the hardware accelerators (eg GPUs) I have on the device.
[You]:
Where is singapore?
[Bot]:
Singapore is located in southeast Asia and is a city-state. It has a total land area of 721 square kilometers (280 sq mi) and an area of about 635 square kilometers (249 sq mi). The official language is English, but Chinese is also widely spoken. Singapore is divided into several islands, including Sentosa, which serves as a popular tourist destination. Some notable landmarks in Singapore include the Marina Bay Sands Hotel and Resort, the iconic Marina Square shopping mall, the iconic Gardens by the Bay flower garden complex, the historical St. Andrew's Cathedral, and the modern Istana.
[You]:
Is there any famous universities?
[Bot]:
Yes, many top universities are located in Singapore. Some notable ones include:
1) Nanyang Technological University (NTU) - a leading research-intensive institution with campuses throughout the island city-state
2) National University of Singapore (NUS) - a world-renowned institution with campuses across the country and numerous international partnerships
3) Singapore Management University (SMU) - a business school with locations in the city and the island of Sentosa
4) University of Singapore (Singapore) - a public research university with multiple campuses throughout the city-state and international partnerships
5) Nanyang Polytechnic - a polytechnic institution that offers both undergraduate and postgraduate programs, as well as professional training courses
6) Temasek Polytechnic - a polytechnic institution that focuses on providing vocational education and training, with campuses in Singapore and other parts of Asia. These universities offer excellent academic opportunities for students who are interested in pursuing further study at graduate or doctoral levels.
Create an OpenAI-compatible API service for TinyLlama-1.1B-Chat-v1.0
An OpenAI-compatible web API allows the model to work with a large ecosystem of LLM tools and agent frameworks such as flows.network, LangChain and LlamaIndex.
Download an API server app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices. The Rust source code for the app is here.
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm
Then, download the chatbot web UI to interact with the model with a chatbot UI.
curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz
Next, use the following command lines to start an API server for the model. Then, open your browser to http://localhost:8080 to start the chat!
wasmedge --dir .:. --nn-preload default:GGML:AUTO:tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf llama-api-server.wasm -p chatml -c 4096
You can also interact with the API server using curl from another terminal .
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are an AI programming assistant."}, {"role":"user", "content": "What is the capital of France?"}], "model":" TinyLlama-1.1B-Chat-v1.0"}'
That’s all. WasmEdge is easiest, fastest, and safest way to run LLM applications. Give it a try! Join the WasmEdge discord to ask questions or share insights.