Llama CPP
Llama CPP provides an efficient way to run language models locally on consumer devices. With support for models in the new GGUF format, you can run lightweight models for chat and code generation tasks without relying on external APIs.
Supported Models
Some of the recommended models include:
-
TinyLlama 1.1B Chat. A small chat model ideal for conversational tasks.
-
DeepSeek-R1-Distill-Qwen-1.5B. An instruct-tuned model offering decent performance with various quantizations (e.g. Q3_K_M).
-
Gemma-2-2b-it. A 2B model fine-tuned for general text tasks; available in a Q4_K_M quantization for balanced quality and efficiency.
-
Any GGUF model .
Installation and Setup
Prerequisites
- Install Intellinode via pip (with the optional llama-cpp extra):
pip install intelli[llamacpp]
Downloading Models
You can download the models using the Hugging Face Hub. For example, to download TinyLlama 1.1B Chat:
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir ./models
To download DeepSeek-Qwen-1.5B and Qwen2.5-Coder-3B-Instruct:
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_M.gguf --local-dir ./models
And for gemma-2-2b-it:
huggingface-cli download bartowski/gemma-2-2b-it-GGUF gemma-2-2b-it-Q4_K_M.gguf --local-dir ./models
Using the Chatbot with Llama CPP
Importing the Chatbot
Import the necessary classes:
from intelli.function.chatbot import Chatbot, ChatProvider
from intelli.model.input.chatbot_input import ChatModelInput
Initializing the Chatbot
Set up the model options for llama.cpp by providing the local model path and necessary parameters. For example, to initialize with TinyLlama 1.1B Chat:
options = {
"model_path": "./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
"model_params": {
"n_ctx": 512,
"embedding": False,
"verbose": False
}
}
llama_bot = Chatbot(provider=ChatProvider.LLAMACPP, options=options)
Generating a Chat Response
Prepare your conversation using ChatModelInput and then call the chatbot:
chat_input = ChatModelInput("You are a helpful assistant.", model="llamacpp", max_tokens=64, temperature=0.7)
chat_input.add_user_message("What is the capital of France?")
response = llama_bot.chat(chat_input)
print("Chat Response:", response["result"][0])
Streaming Chat Responses
For streaming output (token-by-token), use:
import asyncio
async def stream_chat():
chat_input = ChatModelInput("You are a helpful assistant.", model="llamacpp", max_tokens=64, temperature=0.7)
chat_input.add_user_message("Tell me a joke.")
output = ""
async for token in llama_bot.stream(chat_input):
output += token
print("Streaming Chat Response:", output)
asyncio.run(stream_chat())
Using Qwen2.5-Coder
options = {
"model_path": "./models/qwen2.5-3b-coder-instruct-q4_0.gguf",
"model_params": {
"n_ctx": 1024,
"embedding": False,
"verbose": False
}
}
qwen_bot = Chatbot(provider=ChatProvider.LLAMACPP, options=options)
Then, to generate code:
chat_input = ChatModelInput("You are a coding assistant.", model="llamacpp", max_tokens=128, temperature=0.3)
chat_input.add_user_message("Write a Python function to reverse a string.")
response = qwen_bot.chat(chat_input)
print("Code Generation Output:", response["result"][0])
Note: You can suppress noisy logs from llama.cpp by redirecting stderr during model loading.
Conclusion
By following these steps, you can run an offline chatbot using llama.cpp with high performance. The GGUF format ensures efficient memory usage and fast inference on typical hardware.