-
Getting Started with Mistral-7b-Instruct-v0.1
The mistral-7b-instruct-v0.1 model is a 7B instruction-tuned LLM released by Mistral AI. It is a true open source model licensed under Apache 2.0. It has a context length of 8,000 tokens and performs on par with 13B llama2 models. It is great for generating prose, summarizing documents, and writing code. In this article, we will cover How to run mistral-7b-instruct-v0.1 on your own device How to create an OpenAI-compatible API service for mistral-7b-instruct-v0.…
-
Getting Started with Dolphin-2.2-yi-34b
The dolphin-2.2-yi-34b model is based on the 34B LLM, Yi, released by the 01.AI team. Yi is converted to the llama2 format by Charles Goddard and then further fine-tuned by Eric Hartford. In this article, we will cover How to run dolphin-2.2-yi-34b on your own device How to create an OpenAI-compatible API service for dolphin-2.2-yi-34b We will use the Rust + Wasm stack to develop and deploy applications for this model.…
-
Fast and Portable Llama2 Inference on the Heterogeneous Edge
The Rust+Wasm stack provides a strong alternative to Python in AI inference. Compared with Python, Rust+Wasm apps could be 1/100 of the size, 100x the speed, and most importantly securely run everywhere at full hardware acceleration without any change to the binary code. Rust is the language of AGI. We created a very simple Rust program to run inference on llama2 models at native speed. When compiled to Wasm, the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators.…
-
Wasm as the runtime for LLMs and AGI
Large Language Model (LLM) AI is the hottest thing in tech today. With the advancement of open source LLMs, new fine-tuned and domain-specific LLMs are emerging everyday in areas as diverse as coding, education, medical QA, content summarization, writing assistance, and role playing. Don't you want to try and chat with those LLMs on your computers and even IoT devices? But, Python / PyTorch, which are traditionally required to run those models, consists of 3+GB of fragile inter-dependent packages.…
-
AI Inference for Real-time Data Streams with WasmEdge and YoMo
YoMo is a programming framework enabling developers to build a distributed cloud system (Geo-Distributed Cloud System). YoMo's communication layer is made on top of the QUIC protocol, which brings high-speed data transmission. In addition, it has a built-in Streaming Serverless “streaming function”, which significantly improves the development experience of distributed cloud systems. The distributed cloud system built by YoMo provides an ultra-high-speed communication mechanism between near-field computing power and terminals. It has a wide range of use cases in Metaverse, VR/AR, IoT, etc.…