And for anyone who wants a real deep dive, here's the entire plan:
# Plan: Full Multi-Metric Evaluator (quality, security, human_like)
## Context
MorganBench scores runs on `functional` and `efficiency` today; `quality`,
`security`, and `human_like` are explicit `None` placeholders in
`harness/evaluator/scorer.py` and excluded from the weighted `total`. The
project's spec calls for all five metrics. This feature implements the three
missing ones so a run produces a complete, honest score — without faking values
when a tool or API key is unavailable.
Decisions (from clarifying questions):
- **Judges**: 3-judge `human_like` ensemble runs **on by default**, reusing the
run's `provider:model` with 3 distinct reviewer personas. `--judge-model`
overrides; if no API key/provider error, `human_like=null` with a reason
(mock runs judge for free, deterministically).
- **Languages**: analyzers cover **Python, TS/JS, Go, Java**. Python is fully
native (added pip deps). The others shell out to their toolchains *if present
on PATH*; when a tool is missing, that language reports `null` a recorded
reason — never a silent zero or inflated score.
- **Weights**: `functional 0.50, quality 0.15, security 0.15, efficiency 0.10,
human_like 0.10`, re-normalized over whichever metrics are non-null.
## Design overview
New modules under `harness/evaluator/`:
- `
langs.py` — file-extension → language map; per-language analyzer registries.
- `
quality.py` — `assess_quality(workspace, changed_files) -> MetricResult`.
- `
security.py` — `assess_security(workspace, changed_files) -> MetricResult`.
- `
judges.py` — `assess_human_like(issue, patch, verifier_summary, provider, n=3) -> MetricResult`.
- `
evaluate.py` — `evaluate_run(...)` orchestrator the loop calls; runs all
metrics, records per-metric trace events, returns the final scores dict.
`MetricResult` (pydantic): `score: float | None`, `details: dict` (tool used,
per-language/per-file counts, and `reason` when skipped). Scores are 0–1.
Analysis is scoped to the **agent's changed files** (the git diff), not the
whole repo, so we measure the agent's work. Add `_git_changed_files(workspace)`
to `harness/agent/loop.py` (parse `git diff --cached --name-only`).
###
quality.py
Per language, run the linter/complexity tools on changed files of that language;
combine sub-scores, then average across analyzed languages.
- **Python (native deps)**: `ruff check --output-format=json` (violation
density), `radon cc -j` (cyclomatic complexity), `radon mi -j` (maintainability
index). `quality_py = mean(lint_score, complexity_score, mi_score)` where
`lint_score = max(0, 1 - violations/changed_loc)`, `complexity_score` maps avg
CC (≤5 → 1.0, decaying), `mi_score = MI/100`. Formula constants live at module
top, documented and tunable.
- **TS/JS**: `npx --no-install eslint -f json` if resolvable, else `null` reason.
- **Go**: `gofmt -l` `go vet` if `go` on PATH, else `null` reason.
- **Java**: `checkstyle` if on PATH, else `null` reason.
- If no language analyzed → `score=None`, `details.reason="no analyzer available"`.
###
security.py
Per language, severity-weighted: `score = clamp(1 - (0.4·high 0.15·med 0.05·low), 0, 1)`.
- **Python (native dep)**: `bandit -r <changed .py> -f json`.
- **TS/JS**: `npm audit --json` if `package.json` lockfile present, else `null` reason.
- **Go**: `gosec -fmt=json` (or `govulncheck`) if on PATH, else `null` reason.
- **Java**: `spotbugs`/find-sec-bugs if on PATH, else `null` reason.
- Reuse: point `
LocalToolExecutor.security_scan` (`harness/agent/local_executor.py:121`)
at `assess_security` so the agent's `security_scan` tool returns real data.
###
judges.py
- 3 personas (CORRECTNESS, READABILITY, MAINTAINABILITY). Each builds a chat:
system = persona rubric; user = issue patch one-line verifier outcome
instruction to return **only** JSON `{"score": 0-100, "rationale": "..."}` and
the sentinel string `MORGANBENCH_JUDGE`.
- Call `provider.complete(messages, tools=[])`; `_extract_json()` robustly pulls
the JSON (first `{...}` block, regex `"score"` fallback). `human_like =
mean(valid votes)/100`. `details` = per-judge score rationale model persona.
- Each judge independent: a `ProviderError` or unparseable reply drops that vote
(note it); if zero valid votes → `human_like=null` reason. Judge token usage
is summed into `details.judge_tokens` and recorded as `judge_vote` trace events
— **not** added to the run's `total_tokens` (efficiency measures the agent).
- `MockProvider.complete` (`harness/agent/providers.py:227`): add a branch — if
any message content contains `MORGANBENCH_JUDGE`, return
`LLMResponse(content='{"score": 80, "rationale": "mock judge"}')`. Keeps
`mock:synthetic` deterministic for the demo and tests.
###
scorer.py changes
Extend `score_run` to accept `quality`, `security`, `human_like` kwargs (default
`None`); set `_WEIGHTS` to the five-metric defaults above. Existing
present-only re-normalization (
scorer.py:83-87) already handles `None` exclusion.
###
evaluate.py orchestrator
```
evaluate_run(*, functional, total_tokens, steps, workspace, patch,
changed_files, languages, issue, verifier_payload,
judges_enabled, judge_provider, collector=None) -> dict
```
Runs quality security judges (when enabled), records `quality_detail`,
`security_detail`, `judge_vote` events via `collector`, then returns
`score_run(...)` output augmented with each metric's `details` under a
`metric_details` key in the summary `extra`.
###
loop.py cli.py integration
- `harness/agent/loop.py:112-114`: after `_verify`, compute
`changed = _git_changed_files(workspace)`, build the judge provider
(`get_provider(judge_model or model)`), call `evaluate_run(...)` instead of
`score_run(...)`, then `collector.record("metric_computed", scores)`.
- `run_agent` signature gains `judges: bool = True`, `judge_model: str | None = None`.
- `harness/cli.py` `run` gains `--judges/--no-judges` (default on) and
`--judge-model` options, passed through to `run_agent`.
### Dependencies (`pyproject.toml`)
Move `ruff` into main `dependencies` (now used at runtime) and add `bandit>=1.7`
and `radon>=6.0`. Go/Java/JS tools are external (documented as optional in
`docs/ROADMAP.md`/README, not pip deps).
## Files to create / modify
- **Create**: `harness/evaluator/langs.py`, `
quality.py`, `
security.py`,
`
judges.py`, `
evaluate.py`.
- **Modify**: `harness/evaluator/scorer.py` (weights new kwargs),
`harness/evaluator/__init__.py` (exports), `harness/agent/loop.py`
(`_git_changed_files`, call `evaluate_run`, new args), `harness/agent/cli.py`
→ `harness/cli.py` (flags), `harness/agent/providers.py` (`MockProvider`
judge branch), `harness/agent/local_executor.py` (`security_scan` reuse),
`pyproject.toml` (deps).
- **Tests**: `tests/test_quality.py`, `tests/test_security.py`,
`tests/test_judges.py` (new); update `tests/test_evaluator.py` (the two tests
asserting `quality/security is None`), `tests/test_agent_loop.py` (assert
non-null metrics for mock hello-world), `tests/test_cli.py` (`--no-judges`).
## Honesty / degradation rules (must hold)
- A missing toolchain → that language's metric is `null` with a `reason`, logged
to the trace; aggregate over analyzed languages only; all-missing → metric `null`.
- Judge failure / no API key → `human_like=null` with reason; never a fabricated score.
- Judge tokens never inflate or deflate `efficiency`.
## Verification
1. `pip install -e ".[dev]"` (picks up bandit/radon/ruff).
2. `morganbench run --task hello-world --model mock:synthetic` → `summary.json`
`scores` now has non-null `quality`, `security`, `human_like`, and a `total`
that blends all five; `metric_details` present; `trace.jsonl` shows
`quality_detail`/`security_detail`/`judge_vote` events.
3. `morganbench run --task hello-world --model mock:synthetic --no-judges` →
`human_like` is `null`, total re-normalizes over the other four.
4. Negative-path check: a workspace file with a bandit-flagged pattern (e.g.
`
subprocess.call(x, shell=True)`) drops `security` below 1.0.
5. `ruff check harness backend tests` clean; `mypy harness` clean;
`pytest` green; core harness coverage stays ≥80%.
6. Dashboard `/run/[id]` already renders all five score keys — they now show
real values instead of `n/a` (no dashboard change needed).