GLM-5.2 just crushed our latest Python/data-engineering eval.
It didn’t win the approval-gate tests — but it came closest to Opus-4.8 in the areas that actually matter for real engineering work.
Key highlights:
Financial time-series / pandas logic — its strongest area • Saw that rolling(window=3) is row-based, not time-based • Recognized sparse rolling("3D") still misses calendar gaps • Shifted to dense daily calendar rolling • Handled empty, single-row, all- NaN edges cleanly • Self-corrected mid-process (“earlier fix was incomplete”)Opus-4.8: ~9.5 GLM-5.2: ~9.3 Gap: ~0.2On pure pandas mechanics, GLM-5.2 is basically in the same tier.
Daily revenue / grouped transaction metrics Understood grouping multiples per day, keeping refunds negative, inserting explicit zero-revenue days, and proper resampling/normalization.Opus-4.8: ~9.4–9.5 GLM-5.2: ~9.2–9.3 Gap: ~0.2–0.3
React hidden-state debugging Correctly separated filtering (table) from selection (detail panel) and fixed the bug at the architectural level. Gap: ~0.4–0.6
Where it still falls short: Approval-gate discipline. It understands the gates conceptually but drifts back into PRD-style review inside the build gate and is less strict about self-rejection. That ~1.0–1.4 gap is why it doesn’t replace Opus/Fable/GPT in the gates yet.
Bottom line:GLM-5.2 is now the closest model to Opus-4.8 on pandas/financial data logic, grouped metrics, and second-pass refinement. Huge step forward on implementation mechanics. Still needs more work on design authority, gate discipline, and knowing when its own answer isn’t good enough.
Impressive run, GLM team.