Operations Runbook

Edited by Lawrence Beckwith on February 22, 2026 at 4:40 AM UTC

Daily Health Checks

# Full dependency status
curl http://127.0.0.1:8080/dependencies/health

# Coordinator recent logs
docker logs otheru-coordinator --since=1h

# JetKVM bridge live check (frame_age should be < 100ms)
curl http://127.0.0.1:8005/stats

# GSD loop status (if running)
curl http://127.0.0.1:8090/gsd/report

Coordinator

# Restart
docker restart otheru-coordinator

# Follow logs
docker logs otheru-coordinator -f --since=5m

# Verify routing for a prompt
docker exec otheru-coordinator python3 -c "
import sys; sys.path.insert(0, '/app')
from routing import needs_tools
print(needs_tools('your prompt here'))
"

Hardware Bridge (JetKVM)

# Bridge status
curl http://127.0.0.1:8005/stats

# Capture a screenshot
curl http://127.0.0.1:8005/screenshot | python3 -c \
  "import sys,json,base64; d=json.load(sys.stdin); \
   open('/tmp/desktop.jpg','wb').write(base64.b64decode(d['image_base64']))"

# Start a screen recording
curl -X POST http://127.0.0.1:8005/record \
  -H 'Content-Type: application/json' \
  -d '{"output_path":"/tmp/recording.mp4","duration":60,"fps":15}'

# Patch bridge.py
docker cp /path/to/bridge.py hardware-bridge:/app/bridge.py
docker restart hardware-bridge

Agent Memory Management

# List all agent states + memory usage
curl http://127.0.0.1:8080/agents/status

# Load an agent on demand
curl -X POST http://127.0.0.1:8080/agents/fara/load

# Unload to free memory
curl -X POST http://127.0.0.1:8080/agents/fara/unload

GSD Loop

curl -X POST http://127.0.0.1:8090/gsd/start
curl -X POST http://127.0.0.1:8090/gsd/pause
curl http://127.0.0.1:8090/gsd/report
curl http://127.0.0.1:8090/gsd/status
curl -X POST http://127.0.0.1:8090/gsd/set-objective \
  -H 'Content-Type: application/json' -d '{"objective":"..."}'

Incident Triage

  1. Identify scope — which tier or service is failing?
  2. Check logsdocker logs <container> --since=10m
  3. Validate dependenciescurl http://127.0.0.1:8080/dependencies/health
  4. Apply targeted fix — restart container, patch config, reload model
  5. Verify recovery — re-run health checks
  6. Document — update runbook with root cause and fix

Common Issues

KVM requests not routing to Fara

Check that KVM-related keywords are present in routing_policy.json under tool_intent_keywords.

docker exec otheru-coordinator python3 -c "
import sys; sys.path.insert(0, '/app')
from routing import needs_tools
print(needs_tools('use the kvm to open notepad'))  # should be True
"

If False, add the missing keyword patterns to core/config/routing_policy.json and restart the coordinator.

Display goes dark after EDID change

Custom low-resolution EDIDs may lack fallback timing modes. Reset to the native display EDID:

curl -X POST http://127.0.0.1:8005/edid \
  -H 'Content-Type: application/json' \
  -d '{"edid_name": "T749-fHD720"}'

The call may time out but the change applies. Restart the bridge to confirm.

Memory pressure / model load failures

Unload agents that aren't currently needed:

curl -X POST http://127.0.0.1:8080/agents/fara/unload
curl -X POST http://127.0.0.1:8080/agents/reasoner/unload

Check overall memory with free -h and docker stats --no-stream.

Coordinator doesn't start after a code change

Syntax error in a bind-mounted file. Validate before restarting:

cd otheru-core/core/coordinator
python3 -c "import py_compile; py_compile.compile('changed_file.py', doraise=True)"
docker restart otheru-coordinator

Release Hygiene

  • Keep compose and env changes version-controlled
  • Stage risky changes with a controlled rollout
  • Coordinator source is bind-mounted — edit on host, never inside the container
  • Bridge changes require docker cp + restart
  • routing_policy.json changes require docker restart otheru-coordinator

WMMA Ops Profiling

Standard workflow

  1. Run timing benchmarks for kernel variants
  2. Capture HIP traces for launch behavior and synchronization overhead
  3. Use rocprofv3 where available to compare instruction and memory patterns

gfx1151 limitation

Many hardware performance counters are unavailable on consumer gfx1151 (aqlprofile-backed counters fail). Prioritize:

  • Wall-clock benchmark stability
  • PyTorch profiler traces
  • HIP API trace analysis

Rule: Promote a kernel variant only when it passes both correctness checks and repeatable timing benchmarks. Keep the adaptive fallback enabled.