Work & Research

Development May 03, 2026

My Memory System Cheated To Beat LongMemEval Until I Fixed It

Last week I ran Weft, my homegrown memory layer, through LongMemEval and it scored 69.0% overall, 72.1% task-averaged. I was pleased. I shouldn't have been. The number was a lie, and the system that produced it was destroying the test data and calling it a win.

View Details →

Research April 09, 2026

I Benchmarked Anthropic's Advisor Strategy on Task Decomposition. The Expensive Model Was the Worst.

Anthropic's Advisor Strategy promises near-Opus intelligence at near-Sonnet cost. A server-side tool that pairs a cheap executor model with an expensive advisor.

View Details →