Skip to content

LLM Benchmark Results

Benchmarks compare LLM models for HomeBotAI agent-style tasks. Tests were run on a Mac Mini (M4, 64GB RAM). Automated suites live under tests/llm/.

Result files

  • Task quality (benchmark suite): tests/llm/results/benchmark_2026-03-22_18-47-09.json
  • Tool calling: tests/llm/results/tool_calling_2026-03-22_19-48-02.json

Benchmark suite (task quality)

Configuration: 2 iterations per task, 12 runs per model.

Model Pass rate Avg latency Min / max Total tokens Runs
gemini-2.5-flash 100.0% 1.7s 1.2s / 2.2s 2428 12
qwen3.5:9b 75.0% 76.5s 25.1s / 109.0s 9495 12
qwen3.5:4b 33.3% 54.3s 28.0s / 72.5s 11607 12
qwen3.5:2b 33.3% 38.9s 25.1s / 51.0s 11859 12
sorc/qwen3.5-claude-4.6-opus-q4:9b 83.3% 24.7s 14.9s / 57.1s 3415 12
sorc/qwen3.5-claude-4.6-opus-q4:4b 91.7% 33.1s 12.1s / 77.2s 6051 12
sorc/qwen3.5-claude-4.6-opus-q4:2b 100.0% 8.5s 4.2s / 19.5s 3663 12

Per-task results

Scores are passed runs / total runs (out of 2 iterations per task).

Task gemini-2.5-flash qwen3.5:9b qwen3.5:4b qwen3.5:2b sorc/...q4:9b sorc/...q4:4b sorc/...q4:2b
basic_chat 2/2 2/2 0/2 0/2 1/2 1/2 2/2
ha_parsing 2/2 1/2 0/2 0/2 2/2 2/2 2/2
json_structured 2/2 2/2 2/2 2/2 1/2 2/2 2/2
media_query 2/2 2/2 0/2 0/2 2/2 2/2 2/2
skill_prompt 2/2 2/2 2/2 2/2 2/2 2/2 2/2
summarization 2/2 0/2 0/2 0/2 2/2 2/2 2/2

Tool calling suite

Configuration: 2 iterations per scenario, 8 runs per model. All models scored 100% pass rate.

Model Pass rate Avg latency Min / max Total tokens Runs
gemini-2.5-flash 100.0% 1.5s 1.1s / 1.8s 2550 8
qwen3.5:9b 100.0% 16.4s 12.0s / 20.4s 4872 8
qwen3.5:4b 100.0% 11.8s 9.0s / 15.5s 4979 8
qwen3.5:2b 100.0% 6.6s 4.7s / 8.7s 4951 8
sorc/qwen3.5-claude-4.6-opus-q4:9b 100.0% 14.9s 12.4s / 15.7s 4755 8
sorc/qwen3.5-claude-4.6-opus-q4:4b 100.0% 9.4s 6.2s / 14.8s 4749 8
sorc/qwen3.5-claude-4.6-opus-q4:2b 100.0% 5.9s 3.9s / 9.1s 5079 8

Per-scenario results

Every scenario achieved 2/2 passes for all models listed above.

Scenario Description
tool_light_control Turn lights on/off with brightness
tool_media_search Search media libraries
tool_thermostat_set Set thermostat temperature
tool_weather_query Query weather conditions

Notes

Interpretation

  • Latency depends on hardware, Ollama load, and model preload or unload behavior.
  • Pass rate comes from automated validators, not subjective human ratings of answer quality.

Regenerating results

From the project root (with the correct Python environment):

python tests/llm/test_benchmark.py
python tests/llm/test_tool_calling.py