Create a persistent, context-aware AI agent system where:
- AI Agent handles routine operations, diagnostics, and documentation
- Human makes strategic decisions, handles physical tasks, and provides domain expertise
- Together we create a continuously learning, self-improving homelab
Scenario: Upgrading 1GbE to 2.5GbE network
AI Responsibilities:
- Inventory current network topology from documentation
- Run discovery commands to validate actual vs documented state
- Identify incompatible devices and required hardware changes
- Create migration plan with minimal downtime windows
- Update network diagrams and documentation
- Monitor post-upgrade performance and flag issues
Human Responsibilities:
- Physical cable replacement and switch installation
- Strategic decisions (budget, timeline, priority order)
- Validate AI's migration plan and approve execution
- Handle unexpected hardware compatibility issues
Collaboration Flow:
Human: "I want to upgrade to 2.5GbE network"
AI: *analyzes current topology* "I see 12 devices on 1GbE. Here's my upgrade plan..."
AI: *creates migration checklist* "Steps 1-5 are automated. You'll need to handle step 6 (physical)"
Human: *reviews and approves* "Looks good, but delay the k3s nodes until next weekend"
AI: *updates plan* "Modified timeline. I'll prep configs now, execute automation Friday night"
Scenario: Deploying local LLM server with GPU acceleration
AI Responsibilities:
- Check GPU availability across nodes (
nvidia-smion each host) - Determine K3s vs orchestrator deployment based on requirements
- Generate appropriate manifests/configs following established patterns
- Monitor deployment progress and validate functionality
- Create service documentation and update inventory
- Set up monitoring and alerting
Human Responsibilities:
- Strategic service decisions (model selection, resource allocation)
- Approve resource-intensive deployments
- Handle authentication and security policy decisions
- Validate AI's service architecture recommendations
Scenario: Services losing IP addresses after reboots (recent real issue)
AI Responsibilities:
- Detect pattern from monitoring data and logs
- Identify root cause (missing MAAS registration, DHCP lease expiry)
- Design solution (orchestrator enhancement for MAAS integration)
- Implement automated remediation
- Update runbooks and prevention procedures
- Generate RCA document following established format
Human Responsibilities:
- Approve infrastructure changes that affect critical services
- Validate AI's root cause analysis
- Make policy decisions about service persistence strategies
Scenario: Monitoring dashboard showing inconsistent data
AI Responsibilities:
- Run diagnostic commands across all monitoring components
- Compare expected vs actual metric collection
- Identify missing monitors, configuration drift, or connectivity issues
- Execute standard troubleshooting procedures from runbooks
- Update Uptime Kuma configurations idempotently
- Document findings and create/update troubleshooting guides
Human Responsibilities:
- Interpret business impact of monitoring gaps
- Decide priority of fixes vs new feature requests
- Validate AI's proposed monitoring architecture changes
Human: "Check why k3s-vm-still-fawn seems slow"
AI: *runs diagnostics* "CPU at 95%, found 3 runaway processes. Standard fix applied.
Also noticed this pattern 3x this month - should we add resource limits?"
Human: "I want to deploy Ollama for local LLM testing"
AI: *checks GPU availability* "still-fawn has RTX 3070 available. Trial deployment via orchestrator
or production K3s/GitOps? Estimated resources: 8GB GPU, 16GB RAM"
Human: "Trial first, then production if it works well"
AI: *deploys via orchestrator* "Deploying... DNS: ollama-trial.maas. Will create GitOps PR after validation"
- INFRASTRUCTURE.md - Current state, topology, resource inventory
- SERVICES.md - Service catalog, deployment patterns, decision matrix
- PROCEDURES.md - Standard operating procedures and command patterns
- TROUBLESHOOTING.md - Known issues, diagnostic flows, resolution patterns
- sessions/YYYY-MM-DD.md - Daily interaction logs and decisions made
K3s (GitOps):
- Production services requiring high availability
- Services needing persistent storage/networking
- Multi-replica deployments
- Services with complex dependencies
Orchestrator:
- Trial/experimental services
- Single-instance utilities
- Services requiring rapid deployment/removal
- Development and testing workloads
Manual:
- Infrastructure-level changes (networking, storage)
- Security-sensitive configurations
- Hardware troubleshooting
- One-time maintenance operations
- Create foundational documentation optimized for AI consumption
- Enhance orchestrator with decision-making capabilities
- Build session continuity system for persistent context
- Implement learning loop for continuous improvement
This creates a homelab where human expertise and AI automation complement each other, resulting in both higher reliability and reduced manual overhead.