JetBrains Ships Mellum2: A 12B MoE Coding Model, and the IDE Owner Is Now Building Its Own
JetBrains open-sourced Mellum2, a 12B MoE model that activates just 2.5B parameters, aimed at high-frequency routing, RAG, and sub-agent steps. It signals IDE vendors pulling the model in-house.
Summary
JetBrains released Mellum2 on its official Hugging Face blog, a 12B-parameter Mixture-of-Experts model that activates only 2.5B parameters per token, under an Apache 2.0 license. The interesting part is not the spec sheet but who is shipping it. JetBrains makes IDEs, not frontier models, and it chose to train its own coding model from scratch, with no intention of using it to compete on raw capability with the biggest models.
The judgment worth carrying away is that this points to a fork in coding AI: the winner may be whoever owns the developer’s entry point, not whoever builds the largest model. JetBrains has IDEs that tens of millions of developers open every day, it built a model that hugs the high-frequency steps of its own workflow, and distribution itself is the moat. JetBrains calls Mellum2 a focal model in a system, and that wording is worth unpacking: it concedes it is not the strongest thing in the stack, it is going for the thing that gets called most often and therefore has to be the fastest and cheapest.
What happened
Mellum2 is a 12B Mixture-of-Experts model trained from scratch on natural language and code, activating only 2.5B parameters per token, under Apache 2.0. JetBrains says it is competitive on benchmarks against similarly sized open models while delivering more than 2x faster inference, with the benchmark detail left to an arXiv technical report (paper 2605.31268); the blog itself lists no specific numbers. The weights are already on Hugging Face under the model id JetBrains/Mellum2-12B-A2.5B-Thinking.
The Mellum line started as a code-completion model, and Mellum2 extends it from pure completion to a broader set of natural language and software engineering tasks while deliberately holding the line on efficient inference and easy deployment. The uses JetBrains names are routing and orchestration (prompt classification, tool selection, intermediate control flow), RAG pipelines (context compression, summarization, retrieval post-processing), sub-agents (planning, validation, transformation, context preparation), and self-hosted deployment involving proprietary code or internal data. It explicitly does not do multimodal, only text and code, which JetBrains says is what keeps it compact and efficient.
Why it matters
Put this in the competitive map of coding AI and it exposes a chain of logic that is easy to miss. The narrative of the past two years has been a capability arms race: whose coding benchmark is higher, whose context is longer, who can write a larger feature in one shot. But what users actually touch every day is not the model, it is the IDE, the completion box in the editor. JetBrains products like IntelliJ and PyCharm sit on tens of millions of developers’ machines, and that distribution position is something no pure-model vendor can buy at any price. When JetBrains decides to train its own model, it does not need to win the capability leaderboard. It needs a model that hugs its own workflow, is cheap enough to leave on by default, and fast enough not to interrupt typing. Distribution is the moat, and that holds especially completely in this category.
More telling is the position JetBrains chose for Mellum2. It did not package the model as an in-IDE coding assistant. It told a more general story: modern AI systems increasingly rely on many model calls (routing, retrieval, summarization, planning, validation, tool use), many of those steps are latency-sensitive and do not need the largest model, and Mellum2 targets exactly those steps. That framing lifts it from being JetBrains’ completion engine to being high-frequency middleware for any AI stack. There is real product judgment here: the intermediate steps in completion and agent flows are called constantly and are acutely cost-sensitive, and whoever provides a fast, cheap model at those positions holds a volume that is hard for others to take. There is also some packaging, addressed below.
What does this mean for general coding-model vendors? It means the market is stratifying. The biggest models keep their irreplaceable value on the hardest reasoning and the most complex whole-feature generation, but the ring of high-frequency, low-latency, self-hostable work around them is being pulled back in-house by IDE vendors with their own small models. Coding models from OpenAI or Anthropic will still get called, but increasingly only at the steps that truly need reasoning, not as the default. For a vendor that earns on API call volume, that is a slow bleed in the revenue structure: the most frequent calls are migrating onto open small models deployed inside the user’s own infrastructure.
Builder impact
If you are building AI coding systems, the concrete move Mellum2 prompts is to revisit your model tiering. The judgment behind it is that not every step deserves the most expensive model. Routing, context compression, retrieval post-processing, sub-agent planning and validation, these intermediate steps run fine on a model that computes only 2.5B parameters per token, so you save the expensive large model for the one step that needs reasoning. JetBrains states plainly that the goal is not to replace every model in the stack but to make it faster, cheaper, and easier to control, and that division of labor is worth borrowing whether or not you end up using Mellum2.
For model selection, the 12B MoE size carries a clear tradeoff you can reason against. The 12B total keeps the capability ceiling, the 2.5B active per token holds down inference cost and latency, and Apache 2.0 lets you put it into your own product and private deployments cleanly. If your bottleneck is unit cost at high throughput (say, running completion in real time for every user), this large-total, small-active structure is built for that load. But be honest: the blog gives no concrete benchmark numbers, only “competitive with similarly sized models” and “more than 2x faster,” so if you are actually selecting, go check those numbers in the arXiv report and test on your own tasks. Do not decide on the word “competitive” alone.
For founders building developer tools, there is a strategic warning here. JetBrains’ move shows that the player who owns the entry point will move upstream and build its own model to lock down the high-frequency steps. If your product is a thin shell over someone else’s API, both you and the model vendor you depend on are squeezed from two sides by this trend of entry-point owners building their own models. Either you also hold a distribution position that is hard to replace, or the value you add lives above the model itself. A pure API-forwarding middle layer will find it increasingly hard to stand.
What to ignore
Ignore the reading that JetBrains is squaring off against GPT or Claude on coding benchmarks. The post is clear: the goal is not to replace every model in the stack but to be the fast, well-scoped, high-frequency component. Reading Mellum2 as a challenger to frontier coding models misses that its entire design premise is to concede it is not the strongest and instead be the most-called. Comparing its benchmarks to the capability ceiling of the largest models is comparing the wrong things.
Also see through how much of the “general focal model” narrative is packaging. The post describes Mellum2 throughout as routing/RAG/sub-agent middleware usable in any AI stack, and barely mentions the IDE directly. But the most natural and most defensible landing spot for this model is exactly the high-frequency completion inside JetBrains’ own tens-of-millions install base. Calling it general middleware both attracts a wider pool of developers to try it and downplays the fact that it is, in effect, the dedicated engine for JetBrains’ own toolchain. The real strategic value is the IDE entry point; the general narrative is a market move on the side. Do not let it inflate your estimate of Mellum2’s readiness outside the JetBrains ecosystem.
Finally, do not equate “open weights plus competitive benchmarks” with “you should swap it in now.” Apache 2.0 and 2x inference speed are real benefits, but the blog gives no absolute numbers and no head-to-head against named competitors, and “competitive” is a word you have to verify yourself. Whether it fits you depends on whether your load is genuinely high-frequency and low-latency and whether you genuinely need self-hosting. The right posture is to put it on the shortlist and test it on your tasks, not to change course off a launch post.
FAQ
Is Mellum2 a coding model or an orchestration model?
The launch post positions it as a focal model in a larger system, aimed at routing, RAG post-processing, and sub-agent planning, the high-frequency low-latency steps, not a flagship that writes whole features. It grew out of the Mellum code-completion model, so coding ability is the foundation, but what JetBrains is selling here is fast, cheap middleware inside a multi-model stack.
Can Mellum2 replace a general frontier coding model?
By JetBrains' own account, no, and it does not try to. The post states the goal is not to replace every model in the stack but to make the stack faster, cheaper, and easier to control. It activates only 2.5B parameters per token, good for routing, context compression, and sub-tasks, leaving the expensive large model for steps that truly need reasoning.
What workloads suit a 12B MoE that activates only 2.5B parameters?
High-throughput, low-latency, self-hostable workloads: in-IDE completion, retrieval post-processing in RAG pipelines, intermediate control steps in agent flows. The 12B total keeps the capability ceiling, the 2.5B active per token holds down inference cost, and JetBrains claims more than 2x faster inference than similar-sized models.
Sources
No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).