Holo3.1: Pulling the Computer-Use Agent Back Onto Your Own Machine

H Company ships its first computer-use model you can run locally. It does not chase the top of the leaderboard; it tackles the problem cloud setups cannot escape: every step ships your screen out.

Holo3.1: Pulling the Computer-Use Agent Back Onto Your Own Machine
Photo / Unsplash

Summary

H Company released Holo3.1, a family of computer-use models, on the official Hugging Face blog. Its central move is not to push the leaderboard peak higher. It is to ship quantized weights for the first time, so the same computer-use capability can run on a user’s own device. That is a directional bet: the next obstacle for computer-use agents may not be whether the cloud model is strong enough, but whether the local one runs fast enough and stays private enough.

The judgment worth taking seriously is this: for a sizable share of real workflows, keeping the agent local is now worth more than chasing a few more leaderboard points. That is exactly what Holo3.1 bets on. It offers four sizes (0.8B, 4B, 9B, 35B-A3B) and three quantization formats (FP8, Q4 GGUF, NVFP4), and the whole line is shaped around lower latency and data that never leaves the network, not around rank on any single benchmark.

What happened

Holo3.1 is built on the Qwen family. H Company says it improves robustness across the three dimensions that matter most in production: environments (web, desktop, mobile), agent frameworks, and deployment targets. Of the three, the latter two are where this release concentrates.

Mobile is the most visible gain. On AndroidWorld, the 35B-A3B model rises from 67% to 79.3%, while the smaller 4B and 9B variants go from 58% to 72%. The reason H Company gives is concrete: moving Holo3 from evaluation to production, they kept hitting the same wall. Strong performance in one setting does not necessarily transfer to another, and mobile devices, alternative agent harnesses, and different execution frameworks each bring their own distribution shift.

The second piece is cross-harness reach. On top of the structured JSON outputs already in Holo3, Holo3.1 natively supports function-calling protocols, which makes it easier to drop into third-party agent stacks. Per the blog, across OSWorld and an internal benchmark suite covering e-commerce, business software, and collaboration workflows, function-calling and native execution now reach near-parity; inside H Company’s own Holotab harness, Holo3.1 improves more than 25% over Holo3.

The third piece is the key one: local. This is H Company’s first release with quantized weights, starting with three formats for the 35B-A3B checkpoint: FP8, Q4 GGUF, and NVFP4. NVFP4 uses NVIDIA’s Model Optimizer in a W4A16 configuration. Per the blog the quantization costs almost nothing in accuracy, with FP8 and NVFP4 scoring the same on OSWorld and only about two points below full-precision BF16, and it buys real speed. On DGX Spark, NVFP4 W4A16 delivers 1.41x the total token throughput of FP8 and 1.74x that of BF16.

It also ships Q4 GGUF aimed at consumer hardware. The deployment shape is this: the agent itself runs on the user’s Windows or Mac machine, while the model runs either on that same machine (the blog gives reference numbers for Apple Silicon) or on a DGX Spark on the same local network. In both cases execution stays fully local, with nothing leaving the user’s network. On Spark, the agent-harness optimizations H Company developed with NVIDIA, combined with NVFP4 quantization, deliver a compound roughly 2x end-to-end speedup over the FP8 baseline, cutting average step time from 6.8s to 3.3s.

The family ships in four sizes: 0.8B (ultra-lightweight local agents), 4B (cost-efficient deployment), 9B (balanced performance and latency), and 35B-A3B (state-of-the-art performance). The quantized FP8, NVFP4, and Q4 GGUF checkpoints target local and edge deployment.

Why it matters

Put Holo3.1 next to cloud computer-use agents like Claude’s or OpenAI’s, and the difference is not which model is smarter. It is where the agent runs and where your screenshots go.

Cloud computer-use works like this: at every step, the agent sends a screenshot (or accessibility tree) of your current screen to the provider’s servers, the model there decides the next action, and the action comes back. Two things follow by construction. First, every step pays a network round-trip of latency, which accumulates on long tasks. Second, everything on your screen leaves your machine in that moment. For many use cases this is fine. For others, those two facts are exactly the wall you cannot get past.

The local route tears that wall down. On latency, Holo3.1’s number is an average step time of 3.3s (on Spark, after NVFP4 plus harness optimizations), with no round-trip across the public internet. On privacy it is more direct: H Company’s phrasing is that execution stays fully local, with nothing leaving the user’s network. For teams handling customer data, internal systems, or compliance-restricted information, “the screen never leaves the network” is not an optimization. It is the precondition for whether the thing is usable at all.

But the accounting has to be honest, and Holo3.1 does not claim to erase the cloud’s advantage. Cloud models are not bound by the user’s device compute; they can be larger, upgraded at any time, and require no hardware from the user. The cost of the local route is on the table: only the 35B-A3B size currently ships quantized, and the reference machine behind that 3.3s figure is a DGX Spark, not an ordinary laptop. Between “it runs on Apple Silicon” and “it runs fast enough to be usable on Apple Silicon,” the blog gives reference numbers but no conclusion you can copy wholesale.

So the real judgment is this: computer-use is splitting from one deployment shape into two paths. One is cloud-hosted, chasing the ceiling of the model. The other is locally self-hosted, chasing latency and data sovereignty. Holo3.1 is the most concrete step yet on the second path. It turns “local computer-use agent” from a slogan into a downloadable artifact with sizes, quantization formats, and speed numbers attached.

Builder impact

If you build computer-use products, this release should change your default choice, not just add another model to the shelf.

First decide which camp you are in. If your workflow handles public web pages, is not latency-sensitive, and touches no sensitive data, cloud computer-use is still the easier default; you do not need to provision hardware or carry the ops. But if you hit any of the following, the local route is now worth a serious look: data is not allowed to leave the internal network; latency matters and tasks have many steps; or you need to embed the capability on the customer’s own device rather than your servers. These three cases previously meant either no option or a hand-assembled one. Holo3.1 gives a ready starting point.

Pick the size for the job; do not default to the largest. The blog states each size’s role clearly: 0.8B for ultra-lightweight local agents, 4B for cost-efficient deployment, 9B for the performance-latency balance, and 35B-A3B for top performance. Note one real constraint: quantized weights currently cover only 35B-A3B. So if you want the extreme lightweight combination of small size plus quantization, you still have to do that yourself for now; the release does not do it for you.

Treat the function-calling support as an integration signal. Holo3.1 natively supports function-calling protocols and reaches near-parity with native execution. If you already run a tool-calling agent framework, the cost of wiring it in is lower than with Holo3. But near-parity is not full parity. Before you switch, run a regression on your own tasks beyond OSWorld rather than trusting the internal benchmark alone.

Finally, verify what the 3.3s figure becomes on your hardware. That number assumes a DGX Spark plus the harness optimizations built jointly with NVIDIA, and those optimizations are slated to land in an upcoming desktop agent harness. On Apple Silicon or other consumer machines, your step time will differ. Read it as the order of magnitude local can hit, not as the number you will inevitably get after deploying.

What to ignore

Ignore the narrative that reads Holo3.1 as “local finally beats the cloud.” The blog never runs a head-to-head benchmark against Claude’s or OpenAI’s computer-use; its comparison points are mainly its own Holo3 and the Qwen 3.5 family. Local and cloud are two sets of trade-offs, not a contest with a winner. Which fits you depends on your latency, privacy, and hardware constraints, not on who scores higher.

Ignore the careless conclusion that “quantization barely costs accuracy, so go local with anything.” FP8 and NVFP4 sit only about two points below BF16, a clean number, but it covers only OSWorld and only the 35B-A3B size. Extrapolating it to “quantization is lossless on any size and any task” is not what the blog says.

Also ignore the over-reading behind the mobile gains. AndroidWorld rising from 67% to 79.3% is real progress, but 79.3% does not mean it runs reliably on your specific app, and mobile distribution shift is the difficulty H Company itself flagged. Reading it as “mobile agents are mature now” goes too far. Reading it as “the mobile line is being seriously addressed for the first time” is right.

FAQ

What hardware do you actually need to run Holo3.1 locally?

The blog gives two reference paths: the agent runs on the user's Windows or Mac machine, and the model runs either on that same machine (reference numbers are given for Apple Silicon) or on a DGX Spark on the same local network. Quantized weights (Q4 GGUF for consumer hardware, FP8/NVFP4 for Spark) are what make smaller machines viable. Note that only the 35B-A3B checkpoints currently ship quantized.

Does quantizing Holo3.1 hurt accuracy?

Per the blog, barely. FP8 and NVFP4 achieve the same OSWorld scores, only about two points below the full-precision BF16 checkpoint. The payoff is speed: on DGX Spark, NVFP4 W4A16 delivers 1.41x the total token throughput of FP8 and 1.74x that of BF16.

How is Holo3.1 better than Holo3?

Mostly in coverage rather than peak. Mobile is the biggest gain (35B-A3B rises from 67% to 79.3% on AndroidWorld, the 4B and 9B variants from 58% to 72%); it adds native function-calling support, with more than a 25% improvement over Holo3 inside H Company's own Holotab harness; and it is the first release to ship quantized weights for local inference.

Sources

  1. Holo3.1: Fast & Local Computer Use Agents (H Company, Hugging Face blog) / blog

No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).