PyTorch

The llama2 family of LLMs are typically trained and fine-tuned in PyTorch. Hence, they are typically distributed as PyTorch projects on Huggingface. However, when it comes to inference, we are much more interested in the GGUF model format for three reasons. Python is not a great stack for AI inference. We would like to get rid of PyTorch and Python dependency in production systems. GGUF can support very efficient zero-Python inference using tools like llama.…