Deep dives
Benchmark notes.
Longer writeups can sit here when a model needs more than a scorecard: tool-use behavior, evidence traps, safety misses, and domain-level analysis.
Longer writeups can sit here when a model needs more than a scorecard: tool-use behavior, evidence traps, safety misses, and domain-level analysis.