They do, but I’ve seen a huge slowdown in “getting better” in the last year. I wonder if it’s my perception, or reality. Each model does better on benchmarks but I’m still experiencing at least a 50% failure rate on _basic_ task completion, and that number hasn’t moved higher in many months.
They do, but I’ve seen a huge slowdown in “getting better” in the last year. I wonder if it’s my perception, or reality. Each model does better on benchmarks but I’m still experiencing at least a 50% failure rate on _basic_ task completion, and that number hasn’t moved higher in many months.