-
Notifications
You must be signed in to change notification settings - Fork 0
Model Management
This guide covers how models are managed, selected, and configured in Chat Linux Client.
Chat Linux Client uses a sophisticated model management system (core/model_manager.py) to:
- Track available models from all providers
- Provide model metadata and capabilities
- Enable intelligent model selection
- Support custom model configurations
Each model has the following properties:
-
Name: Model identifier (e.g.,
llama3.2:1b) -
Provider: AI provider (e.g.,
ollama,groq) - Context Window: Maximum input tokens
- Max Output: Maximum output tokens
- Cost: Cost per 1K tokens (if applicable)
- Speed Rating: Relative speed (1-10)
- Quality Rating: Relative quality (1-10)
- Is Local: Whether model runs locally
The system includes pre-configured models for each provider:
| Model | Size | Context | Speed | Quality |
|---|---|---|---|---|
| llama3.2:1b | 1.3GB | 8K | 10 | 6 |
| qwen2.5:3b | 1.9GB | 32K | 8 | 7 |
| phi3.5:3.8b | 2.2GB | 12K | 7 | 8 |
| mistral:7b | 4.4GB | 32K | 5 | 9 |
| Model | Context | Speed | Quality | Cost |
|---|---|---|---|---|
| llama-3.1-8b-instant | 8K | 10 | 8 | Free tier |
| llama-3.1-70b-versatile | 8K | 7 | 9 | Free tier |
| mixtral-8x7b-32768 | 32K | 6 | 9 | Free tier |
| Model | Context | Speed | Quality | Cost |
|---|---|---|---|---|
| gpt-4o | 128K | 8 | 10 | $5/1M input |
| gpt-4-turbo | 128K | 7 | 9 | $10/1M input |
| gpt-3.5-turbo | 16K | 9 | 7 | $0.5/1M input |
| Model | Context | Speed | Quality | Cost |
|---|---|---|---|---|
| meta-llama/Llama-2-7b-chat-hf | 4K | 6 | 8 | Free tier |
| mistralai/Mistral-7B-Instruct-v0.2 | 8K | 7 | 9 | Free tier |
| Model | Context | Speed | Quality | Cost |
|---|---|---|---|---|
| anthropic/claude-3-opus | 200K | 5 | 10 | $15/1M input |
| openai/gpt-4-turbo | 128K | 7 | 9 | $10/1M input |
Models are represented as data classes:
@dataclass
class ModelInfo:
name: str
provider: str
context_window: int
max_output: int
cost_per_1k_tokens: Optional[float] = None
speed_rating: int = 5
quality_rating: int = 5
is_local: bool = FalseThe system can automatically select models based on routing strategy:
Prioritizes local models, falls back to cloud:
models = model_manager.get_models(strategy="offline_first")
# Returns: [local models first, then cloud models]Prioritizes fast models:
models = model_manager.get_models(strategy="speed_optimal")
# Returns: [Groq, fast local models, others]Prioritizes free/cheap models:
models = model_manager.get_models(strategy="cost_optimal")
# Returns: [free local models, free cloud models, paid]Prioritizes high-quality models:
models = model_manager.get_models(strategy="quality_optimal")
# Returns: [GPT-4, Claude, capable local models]Users can manually select models from the dropdown:
model = model_manager.get_model("ollama/llama3.2:1b")Filter models by criteria:
# Get only local models
local_models = model_manager.filter_models(is_local=True)
# Get models with large context
large_context = model_manager.filter_models(min_context=32000)
# Get fast models
fast_models = model_manager.filter_models(min_speed=8)Add custom models in core/model_manager.py:
# In get_all_models() method
ModelInfo(
name="custom-model",
provider="ollama",
context_window=4096,
max_output=2048,
cost_per_1k_tokens=0.0,
speed_rating=7,
quality_rating=8,
is_local=True
)Define profiles for specific use cases:
profiles = {
"fast": {
"model": "ollama/llama3.2:1b",
"max_tokens": 64,
"char_budget": 900
},
"balanced": {
"model": "ollama/mistral:7b",
"max_tokens": 512,
"char_budget": 3000
}
}The system manages context windows to prevent exceeding model limits:
# Calculate available context
available = model.context_window - used_tokens
# Truncate if necessary
if len(prompt) > available:
prompt = prompt[-available:]Pull new models from Ollama:
ollama pull custom-model-nameThen register in model_manager.py:
ModelInfo(
name="custom-model-name",
provider="ollama",
context_window=4096,
max_output=2048,
is_local=True
)For custom providers, implement model listing:
async def list_models(self):
"""List available models for custom provider."""
response = await self._get("/models")
return [ModelInfo(...) for model in response["models"]]Track model performance:
metrics = {
"response_time": 1.5, # seconds
"tokens_per_second": 50,
"total_tokens": 75,
"context_used": 1000
}- Use lightweight models (llama3.2:1b)
- Reduce max tokens
- Use Groq for cloud models
- Enable streaming
- Use capable models (GPT-4, mistral:7b)
- Increase temperature slightly
- Provide more context
- Use larger context windows
- Use local models (Ollama)
- Use free tier cloud models
- Limit max tokens
- Use cost-optimal routing
Compare model performance:
# Test response time
start = time.time()
response = await client.chat_completion(prompt, model)
duration = time.time() - start
# Test quality
quality_score = evaluate_quality(response)When providers add new models, update model_manager.py:
- Check provider documentation
- Add new model to model list
- Update model metadata
- Test new model
- Update documentation
Remove or mark deprecated models:
ModelInfo(
name="old-model",
provider="ollama",
context_window=2048,
max_output=1024,
is_local=True,
deprecated=True # Mark as deprecated
)If a model doesn't appear in the dropdown:
- Verify model is registered in
model_manager.py - Check provider is enabled
- Ensure Ollama model is pulled (for local models)
- Check API key is valid (for cloud models)
- Run system checks
If a model performs poorly:
- Check system resources
- Try a lighter model
- Reduce context window
- Check network connection (for cloud models)
- Update model version
If you get context window errors:
- Reduce input length
- Use model with larger context
- Enable context truncation
- Clear chat history
- Simple queries: Use fast, lightweight models
- Complex tasks: Use capable, larger models
- Privacy-sensitive: Use local models
- Cost-sensitive: Use free models
- Quality-critical: Use premium models
- Keep prompts concise
- Use relevant context only
- Summarize long conversations
- Use models with appropriate context windows
- Monitor token usage
- Set max tokens limits
- Use cost-optimal routing
- Prefer local models when possible