inference

2026-06-11 google

DiffusionGemma: Text Diffusion Finally Reaches Mainstream Open Source

Google open-sourced the first mainstream text diffusion model. The real story isn't 'fast'. It's that the local decode bottleneck moves from memory bandwidth to compute, with bidirectional attention generating 256 tokens at once. The cost: quality, experimental status, and the 26B MoE trade-offs.

open-models inference local-ai

Read analysis

2026-06-10 deepseek

DeepSeek V4 Moves 1M Context Into the Cost-Structure Era

DeepSeek V4 matters because it turns 1M context from a capability demo into a cost, routing, and product-default problem for builders.

frontier-models frontier-progress ai-infra

Read analysis

2026-06-10 deepseek

DeepSeek V4's Open-Weight and API Strategy Is a Distribution Play

DeepSeek V4 pressures closed frontier models by pairing open weights with same-day API availability, compatibility, and a clear migration path.

frontier-models ai-infra inference

Read analysis

2026-06-10 xiaomi

MiMo UltraSpeed's Value Is the Real-Time Interaction Cost Curve

MiMo-V2.5-Pro-UltraSpeed's 1000 tps claim matters less as a speed stunt than as a change in long-output, parallel-sampling, and real-time interaction economics.

inference frontier-models ai-infra

Read analysis

2026-06-10 xiaomi

MiMo UltraSpeed Pulls 1T Models Toward Real-Time Agents, But Not as a General Entry Point

MiMo UltraSpeed is a strong signal for real-time agents, but limited capacity and controlled access make it a premium path rather than a universal production backend.

inference frontier-models ai-infra

Read analysis

2026-06-08 xiaomi

Xiaomi pushed a 1T model to 1000 tokens/s — without special hardware

MiMo-V2.5-Pro-UltraSpeed decodes a trillion-parameter model past 1000 tps on a single 8-GPU commodity node. The real signal is that model-system codesign broke the 'extreme speed needs custom silicon' equation — not the operating-room marketing wrapped around it.

inference frontier-models ai-infra

Read analysis