Why Cloudflare Workers AI Large Models Are Fixing the Agent Cost Crisis Field Notes: Finally making AI agents not bankrupt us We at A...
Why Cloudflare Workers AI Large Models Are Fixing the Agent Cost Crisis
Field Notes: Finally making AI agents not bankrupt us
We at ATXSoft have been hitting a wall with GPT-4 costs for months now. It's honestly exhausting. You build this incredible agent, it works perfectly in a sandbox, and then you see the bill for a week's worth of background testing and realize you just spent the rent money on tokens. It's the "agent tax." Every time your script loops, it has to re-read the same 2,000 tokens of instructions. You're basically paying for the AI to have short-term memory loss over and over again.
Stopping the short-term memory loss cycle
Anyway, Cloudflare finally did something about it. They added support for big models on Workers AI, starting with Kimi K2.5. The weird thing is that it’s not just about the 256k context window though that’s plenty for most real work, even if it’s not the million-plus tokens you get with Gemini 1.5 Pro. The real win is that it’s running on the edge where the latency doesn’t kill you. But here’s the kicker: they added this x-session-affinity header. I spent way too long looking at the docs on this, but it’s basically just a way to tell the load balancer to send your request back to the exact same GPU that already has your long system prompt cached. It’s prefix caching that actually works. You stop paying for the same instructions twice. It’s simple, it’s boring, and it’s beautiful.
Seventy-seven percent and the end of timeout anxiety
Seventy-seven percent. That is the cost reduction Cloudflare saw when they moved their internal security agents to this setup.
To be honest, the new pull-based Async API is the other half of the sanity check. No more holding connections open while a large model thinks, only for the gateway to time out and ruin the whole run. You just fire the job, let the infrastructure handle the queue, and pull the result when it’s done. It makes the whole workflow feel less like a house of cards. The team and I were just talking about this over coffee this is how we actually get agents to work in production without going broke. Anyway, my coffee’s cold and I’ve got three more deployment scripts to fix. Back to the terminal.
Reference
- Source: Powering the agents: Workers AI now runs large models (Cloudflare Blog, 2026).
![[featured] A modern, high-fidelity conceptual render of the Kimi K2.5 model running on Cloudflare’s Infire inference engine.](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj89zMuISmB8l655TJz9tQLN2bT2x8zzVGwB71V78tGuCtdh4lNpJ5bYgF0Db0XfxPElYoafi5v6PrSWSqyg4ZZz51rE1LwO0GXU2OxIK1YXEKbMtfxdCqWhETsg1WFipnTn43HlNqsfPgpzPtjvkl5_-LH6CBpyFuSCXAIkFAO1sJb8XB9iaCTbbAWYRJd/w320-h213/kimi-k25-model-infrastructure-cloudflare.webp)