Wasm as the runtime for LLMs and AGI

Large Language Model (LLM) AI is the hottest thing in tech today. With the advancement of open source LLMs, new fine-tuned and domain-specific LLMs are emerging everyday in areas as diverse as coding, education, medical QA, content summarization, writing assistance, and role playing. Don't you want to try and chat with those LLMs on your computers and even IoT devices?

But, Python / PyTorch, which are traditionally required to run those models, consists of 3+GB of fragile inter-dependent packages. Even if you manage to get them installed, they are often incompatible with your GPU or other hardware accelerators, resulting in very poor performance.

No more! With WasmEdge, you can create and deploy very fast and very lightweight LLM inference applications. It is a very small install (several MBs of simple binary executables), and it's applications are entirely portable across a wide range of CPUs, GPUs, and OSes. Best of all, it has ZERO Python dependency. LFG!

Let's dive into

The Simplest Way to Run LLMs and AI Models

1. Install WasmEdge with LLM Support

You begin with a single line of command to install WasmEdge runtime, complete with LLM support.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s -- -v 0.14.1

OR you could download and copy the WasmEdge install files manually following the installation guide here.

2. Download an LLM Chat App in Wasm

Next, get the ultra small 2MB cross-platform binary – the LLM chat application. It’s a testament to efficiency, requiring no other dependencies and offering seamless operation across various environments. This small Wasm file is compiled from Rust. To build your own, check out the llama-utils repo.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm

3. Download the Llama2 7b Chat Model

curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf

WasmEdge also supports Llama2, Llama 3.1, Llama 3.2, Qwen, Yi, CodeLlama, Codeshell, Mistrial, MistrialLite, TinyLlama, Baichuan, BELLE, Alpaca, Vicuna, OpenChat, Starcoder, OpenBuddy, and many more! Check out model files here.

4. Chat with Llama2 7b model on CLI

Now that you have everything set up, you can start chatting with your Llama2 7b chat-powered LLM using the command line.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf llama-chat.wasm

After that, you can ask Llama2 7b chat any questions about WebAssembly.

[USER]: 
Who is Robert Oppenheimer

[ASSISTANT]:
Robert Oppenheimer was an American theoretical physicist and director of the Manhattan Project, which developed the atomic bomb during World War II. He is widely regarded as one of the most important physicists of the 20th century and is known for his contributions to the development of quantum mechanics and the theory of the atomic nucleus. Oppenheimer was also a prominent figure in the post-war nuclear weapons debate and was a strong advocate for international cooperation on nuclear weapons control.

That’s it! You can use the same llama-chat.wasm file to run other LLMs, like OpenChat, CodeLlama, Mistral, etc.

Resources:

Explore a variety of LLMs that can run on WasmEdge with these additional resources:

Video: Run other LLMs: A comprehensive video guide to other available LLMs.
Docs: Run LLM inference in WasmEdge
Run LLMs in containerd and Docker: Learn how to run and manage LLM apps within Docker + Wasm.

Build a Super-Lightweight AI Agent

1. Create an OpenAI Compatible API Service

When you fine-tune a model with your domain knowledge or a self-host LLama2 model, running the model with CLI is not enough. Next, let’s set up OpenAI-compatible API services for the open source model and then we can integrate the fine-tuned model to other workflows.

Assume you have already installed WasmEdge with ggml pulg-in and downloaded the model you need.

First, download the Wasm file to build the API server via the terminal.

curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-api-server.wasm

Next, run the API server with the model.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf llama-api-server.wasm -p llama-2-chat -s 0.0.0.0:8080

After you see the connection is built, you can use the following command lines to try your model.

curl -X POST http://localhost:8080/v1/chat/completions \
 -H 'accept: application/json' \
 -H 'Content-Type: application/json' \
 -d '{"messages":[{"role":"system", "content": "You are a helpful assistant. Answer each question in one sentence."}, {"role":"user", "content": "Who is Robert Oppenheimer?"}], "model":"llama-2-chat"}'

2. Build RAG Bots and Agents around the model

With an OpenAI compatible API server, you can leverage a large ecosystem of OpenAI tools, such as LangChain and LlamaIndex, to build applications and agents.

Flows.network is a serverless platform to build LLM agents and bots in Rust and Wasm. The flows.network platform allows developers to connect LLMs to external SaaS. With the OpenAI-compatible API server, we can build agents and chatbots by connecting the model service to external SaaS such as Telegram, Slack, and Discord. Check out how to build a RAG-based LLM agent!

Try out a RAG bot that’s built this way to help you learn the Rust language 👉 https://flows.network/learn-rust

Beyond Language AI

AI with WasmEdge is not limited to LLM tasks; it extends to vision and audio as well. Besides the ggml backend, WasmEdge Runtime supports PyTorch, TensorFlow, and OpenVINO AI framework. To make AI inference smoother, WasmEdge also supports the popular OpenCV and FFmpeg libraries for image processing.

Discover how you can apply vision and audio AI with projects like mediapipe-rs, and get hands-on with WASI NN examples.

Mediapipe-rs: a Rust library for MediaPipe tasks
- Source code: Mediapipe-rs GitHub
- Tutorial: Mediapipe solutions
More WASI NN examples: Explore more AI inference examples with WasmEdge and Rust.

Conclusion

WasmEdge is reshaping the landscape of AI by providing a lightweight, secure, and efficient environment for running the most advanced LLMs, including your own fine-tuned Llama 2 models. Whether you're a developer looking to deploy AI applications or an organization seeking to integrate intelligent agents into your operations, WasmEdge is your gateway to a future of endless possibilities. Join us in this revolution and make your mark in the AI world!

Join the conversation and contribute to the WasmEdge discord. Discuss, learn, and share your insights.