On-premise OpenCode plugin with local inference support for air-gapped environments.
oh-my-onpremcode is a fork of oh-my-opencode designed for enterprise on-premise deployments. It enables AI-assisted development in air-gapped environments without internet access.
- Local Inference: Connect to vLLM instances running in your Kubernetes cluster
- Ollama Support: Use local models via Ollama as a fallback
- Air-Gapped Ready: All internet-dependent features auto-disable in on-prem mode
- Model Mapping: Map cloud model names to your local model endpoints
- Custom Branding: Display your company logo and welcome message
- K8s Service Discovery: Auto-discover vLLM services in your cluster
- Quick Start
- Configuration
- On-Prem Mode Behavior
- Agent Configuration
- Full Configuration Reference
- Deployment Guide
- Troubleshooting
- Migration from oh-my-opencode
# Add to your OpenCode config (~/.config/opencode/opencode.json)
{
"plugin": ["oh-my-onpremcode"]
}Create ~/.config/opencode/oh-my-onpremcode.json:
{
"$schema": "https://raw.githubusercontent.com/sionic-ai/oh-my-onpremcode/master/assets/oh-my-onpremcode.schema.json",
"onprem": {
"mode": "onprem",
"vllm": {
"endpoints": [
{
"url": "http://vllm-service.default.svc.cluster.local:8000/v1",
"model": "meta-llama/Llama-3.1-70B-Instruct"
}
]
},
"model_mapping": {
"anthropic/claude-opus-4-5": "meta-llama/Llama-3.1-70B-Instruct",
"anthropic/claude-sonnet-4-5": "meta-llama/Llama-3.1-70B-Instruct",
"openai/gpt-5.2": "meta-llama/Llama-3.1-70B-Instruct"
}
}
}| Mode | Description |
|---|---|
cloud |
Default mode. Uses cloud APIs (Anthropic, OpenAI, Google) |
onprem |
On-premise mode. Uses local inference, disables internet features |
Connect to vLLM instances serving OpenAI-compatible API:
{
"onprem": {
"mode": "onprem",
"vllm": {
"endpoints": [
{
"url": "http://vllm-service:8000/v1",
"model": "meta-llama/Llama-3.1-70B-Instruct",
"api_key": "optional-api-key",
"max_tokens": 4096,
"temperature": 0.7
}
],
"k8s_namespace": "ml-inference",
"k8s_service_name": "vllm-service",
"k8s_port": 8000,
"auto_discover": true
}
}
}| Option | Type | Description |
|---|---|---|
url |
string | Full URL to vLLM OpenAI-compatible endpoint |
model |
string | Model name to use (or "auto" to use request model) |
api_key |
string | Optional API key for authentication |
max_tokens |
number | Maximum tokens for completion |
temperature |
number | Sampling temperature (0-2) |
When auto_discover is enabled, the plugin automatically constructs the vLLM endpoint URL using K8s service DNS:
http://{k8s_service_name}.{k8s_namespace}.svc.cluster.local:{k8s_port}/v1
Use Ollama as a secondary local inference option:
{
"onprem": {
"mode": "onprem",
"ollama": {
"enabled": true,
"host": "localhost",
"port": 11434,
"models": ["llama3.1:70b", "codellama:34b"]
}
}
}| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | false | Enable Ollama integration |
host |
string | "localhost" | Ollama server hostname |
port |
number | 11434 | Ollama server port |
models |
string[] | [] | List of available models |
Map cloud model identifiers to your local models:
{
"onprem": {
"model_mapping": {
"anthropic/claude-opus-4-5": "meta-llama/Llama-3.1-70B-Instruct",
"anthropic/claude-sonnet-4-5": "meta-llama/Llama-3.1-8B-Instruct",
"openai/gpt-5.2": "Qwen/Qwen2.5-72B-Instruct",
"google/gemini-3-pro-high": "meta-llama/Llama-3.1-70B-Instruct"
}
}
}This allows existing agent configurations to work seamlessly with your local models.
Display your company branding in the terminal:
{
"onprem": {
"branding": {
"company_name": "Sionic AI",
"welcome_message": "Welcome to Sionic AI Development Environment",
"primary_color": "#0066CC",
"logo_ascii": " _____ _ _ \n / ____(_) (_) \n| (___ _ ___ _ __ _ ___ \n \\___ \\| |/ _ \\| '_ \\| |/ __| \n ____) | | (_) | | | | | (__ \n|_____/|_|\\___/|_| |_|_|\\___| "
}
}
}| Option | Description |
|---|---|
company_name |
Displayed in startup toast |
welcome_message |
Custom welcome message |
primary_color |
Hex color for UI elements |
logo_ascii |
ASCII art logo for terminal |
logo_path |
Path to logo file |
When mode is set to "onprem", the following features are automatically disabled:
| Feature | Status | Reason |
|---|---|---|
| context7 MCP | Disabled | Requires internet |
| websearch_exa MCP | Disabled | Requires internet |
| grep_app MCP | Disabled | Requires internet |
| Google OAuth | Disabled | Requires internet |
| Auto-update checker | Limited | Requires npm registry |
| Claude Code MCP loader | Disabled | External MCPs may require internet |
All local features continue to work normally:
- LSP tools (hover, goto definition, references, rename, etc.)
- AST-grep search and replace
- File operations (grep, glob)
- Background agents
- Todo management
- Session recovery
Override agent models to use your local endpoints:
{
"agents": {
"Sisyphus": {
"model": "vllm/meta-llama/Llama-3.1-70B-Instruct"
},
"oracle": {
"model": "vllm/Qwen/Qwen2.5-72B-Instruct"
},
"librarian": {
"model": "vllm/meta-llama/Llama-3.1-70B-Instruct"
},
"explore": {
"model": "ollama/llama3.1:8b"
},
"frontend-ui-ux-engineer": {
"model": "vllm/meta-llama/Llama-3.1-70B-Instruct"
},
"document-writer": {
"model": "vllm/meta-llama/Llama-3.1-8B-Instruct"
},
"multimodal-looker": {
"disable": true
}
}
}| Agent | Purpose | Recommended Model Size |
|---|---|---|
| Sisyphus | Primary orchestrator | 70B+ |
| oracle | Architecture & debugging | 70B+ |
| librarian | Documentation & research | 70B+ |
| explore | Fast codebase exploration | 8B-34B |
| frontend-ui-ux-engineer | UI generation | 70B+ |
| document-writer | Technical writing | 8B-34B |
| multimodal-looker | Image/PDF analysis | Disable if no multimodal model |
{
"$schema": "https://raw.githubusercontent.com/sionic-ai/oh-my-onpremcode/master/assets/oh-my-onpremcode.schema.json",
"onprem": {
"mode": "onprem",
"vllm": {
"endpoints": [
{
"url": "http://vllm-primary:8000/v1",
"model": "meta-llama/Llama-3.1-70B-Instruct"
},
{
"url": "http://vllm-secondary:8000/v1",
"model": "Qwen/Qwen2.5-72B-Instruct"
}
],
"k8s_namespace": "ml-inference",
"k8s_service_name": "vllm-gateway",
"k8s_port": 8000,
"auto_discover": true,
"default_endpoint": "http://vllm-primary:8000/v1"
},
"ollama": {
"enabled": true,
"host": "ollama-service",
"port": 11434,
"models": ["llama3.1:70b", "codellama:34b", "qwen2.5:32b"]
},
"model_mapping": {
"anthropic/claude-opus-4-5": "meta-llama/Llama-3.1-70B-Instruct",
"anthropic/claude-sonnet-4-5": "meta-llama/Llama-3.1-8B-Instruct",
"openai/gpt-5.2": "Qwen/Qwen2.5-72B-Instruct",
"google/gemini-3-pro-high": "meta-llama/Llama-3.1-70B-Instruct",
"google/gemini-3-flash": "meta-llama/Llama-3.1-8B-Instruct"
},
"branding": {
"company_name": "Your Company",
"welcome_message": "AI-Powered Development Environment",
"primary_color": "#0066CC"
},
"disable_internet_features": true
},
"agents": {
"Sisyphus": { "model": "vllm/meta-llama/Llama-3.1-70B-Instruct" },
"oracle": { "model": "vllm/Qwen/Qwen2.5-72B-Instruct" },
"librarian": { "model": "vllm/meta-llama/Llama-3.1-70B-Instruct" },
"explore": { "model": "ollama/llama3.1:8b" },
"frontend-ui-ux-engineer": { "model": "vllm/meta-llama/Llama-3.1-70B-Instruct" },
"document-writer": { "model": "vllm/meta-llama/Llama-3.1-8B-Instruct" },
"multimodal-looker": { "disable": true }
},
"disabled_hooks": [],
"disabled_agents": [],
"sisyphus_agent": { "disabled": false }
}- OpenCode >= 1.0.150 installed
- vLLM or Ollama running in your environment
- Network access from OpenCode to inference endpoints
Example vLLM deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama
namespace: ml-inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama
template:
metadata:
labels:
app: vllm-llama
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3.1-70B-Instruct
- --tensor-parallel-size=4
- --max-model-len=32768
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 4
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: ml-inference
spec:
selector:
app: vllm-llama
ports:
- port: 8000
targetPort: 8000version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
command: --model meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
volumes:
ollama-data:# Test vLLM endpoint
curl http://vllm-service:8000/v1/models
# Test chat completion
curl http://vllm-service:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
# Check K8s service
kubectl get svc -n ml-inference
kubectl describe svc vllm-service -n ml-inference
# Check pod logs
kubectl logs -n ml-inference deployment/vllm-llama# Test Ollama
curl http://localhost:11434/api/tags
# List available models
ollama list
# Pull a model
ollama pull llama3.1:70b
# Test chat
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:70b",
"messages": [{"role": "user", "content": "Hello"}],
"stream": false
}'# Enable debug mode
export COMMENT_CHECKER_DEBUG=1
# Check logs
tail -f /tmp/oh-my-onpremcode.log| Issue | Solution |
|---|---|
| "No vLLM endpoint configured" | Add at least one endpoint to vllm.endpoints |
| Connection timeout | Check network access to inference service |
| Model not found | Verify model name matches vLLM/Ollama model |
| Out of memory | Reduce max_tokens or use smaller model |
If migrating from the original oh-my-opencode:
-
Rename config file:
mv ~/.config/opencode/oh-my-opencode.json ~/.config/opencode/oh-my-onpremcode.json
-
Update plugin reference in
opencode.json:{ "plugin": ["oh-my-onpremcode"] } -
Add onprem configuration:
{ "onprem": { "mode": "onprem", "vllm": { ... } } } -
Update agent model references if using cloud models
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
Based on oh-my-opencode by @code-yeongyu.
Maintained by Sionic AI
