An alien flying in from space aboard a comet would look down on Earth and see that there is this highly influential and famous software company called Nvidia that just so happens to have a massively complex and ridiculously profitable hardware business running a collection of proprietary and open source software that about three quarters of its approximately 40,000 employees create.
So it is not a surprise at all to us at all that as the proprietary model makers - OpenAI, Anthropic, and Google are the biggies - continue their rise and intensify their competition that not only is Meta Platforms considering a shift to closed models dubbed Avocado - the open source Llama 4 models are toast, after all - but that Nvidia is doubling down on its Nemotron open source models.
It's very simple. Nvidia can get AI clusters of any scale it needs to do AI training at cost, and given its hugely profitable AI hardware business, Nvidia is the only one that can afford to give its models away for free and charge very little for its AI Enterprise software stack that has libraries to support all kinds of AI and HPC models. (It is $4,500 per GPU per year, which is relatively cheap against a GPU accelerator that costs maybe $35,000 to $45,000 depending on volume and model in the "Blackwell" series.)
In a sense, this is a return to the way hardware and software were sold during the early days of IBM's System/360 mainframe, which broadened the use of computation and data storage in the second wave of computer commercialization six decades ago. Back then, you bought a very expensive mainframe system and it came with a team of blue-suited techies that would help you program it for free. Over the years, companies took control of developing their own application software or went to third parties for it, and Big Blue turned customer service into a profit center through its Global Services behemoth.
This will ultimately, we think, be Nvidia's trajectory as it goes for full stack integration, including datacenters, and vertical integration from the chip up through the highest levels of the software stack. Nvidia could even end up being an AI utility in its own right. (Utility is a much better word than cloud, which is a vague term and intentionally so.)
Nvidia is not new to open source AI models, and obviously has been involved in running just about every open source AI model ever created as well as the closed ones that have become household names such as Google Gemini, Anthropic Claude, and OpenAI GPT. In a prebriefing ahead of the Nemotron 3 unveiling, Kari Briski, vice president of generative AI software for enterprise at Nvidia, said that in the past two and a half years, somewhere around 350 million open source AI frameworks and models have been downloaded, that the Hugging Face repository has over 2.8 million open models with every kind of variation under the sun to create a model for specific use cases, and that around 60 percent of companies are using open source AI models and tools. Briski added that in 2025 Nvidia was the largest contributor for open source stuff on Hugging Face, with 650 open models and 250 open datasets set free.
Nvidia got its start with homegrown transformer models with Megatron-LM, announced in 2019. Megatron-LM could train against 8 billion parameters, and do so across 512 GPU accelerators (using eight-way GPU nodes for model parallelism and 64 of these nodes for data parallelism). Megatron was expanded to 530 billion parameters through a collaboration with Microsoft in 2021 with Megatron-Turing NLG. The Neural Modules, or NeMo for short, toolkit was put out at the same time as the original Megatron-LM model, and the Nemotron models were built using this toolkit and its related libraries.
The original Nemotron models were called Nemotron-4 just to confuse us all, and they came out in June 2024 and spanned 340 billion parameters. With the Nemotron 1 models, Nvidia mashed up the Llama 3.1 foundation models with Nemotron reasoning techniques to create Llama Nemotron, spanning 8B, 49B, 70B, and 235B parameter scales.
With Nemotron 2 Nano, which came out earlier this year and which has variants with 9 billion and 12 billion parameters, Nvidia takes the transformer approach pioneered by Google in June 2017 and brought to fruition with its BERT model back in October 2018 and interleaves it with the Mamba selective state space approach developed by researchers at Carnegie Mellon and Princeton University. The former is good at extracting features and dependencies from a lot of data, and the latter is very good at zeroing in on smaller subsets of data and their dependencies.
With Nemotron 3, which was unveiled this week, Nvidia is creating a hybrid mixture of expert (MoE) architecture designed to drive multi-agent systems that builds on this hybrid Mamba-Transformer architecture. The result, says Briski, is increased reasoning efficiency thanks to the hybrid architecture.
"The hybrid Mamba-Transformer architecture runs several times faster with less memory because it avoids these huge attention maps and key-value caches for every single token," Briski explained. "So that architecture really reduces that memory footprint, which allows you to have more experts. We are going to introduce in the Super and Ultra versions a breakthrough called latent mixture of experts. All of these experts that are in your model share a common core and keep only a small part private. So it is kind of like chefs sharing one big kitchen, but they get to use their own spice rack. So there you are going to get even more memory efficiency with Super and Ultra through this latent MOE."
The Nemotron 3 family has three members at the moment, two of which Briski mentioned by name there.
The Nemotron 3 family could very likely expand to larger and smaller models over time. Like other MoE models, there is an aggregate of parameters that the model is trained on and then a smaller subset of parameters that can be activated as it is being fine-tuned or doing inference. Nemotron 3 Nano has 30 billion parameters, with 3 billion activated at any time and is designed specifically so it can fit on a single Nvidia L40S GPU inference accelerator. The Super variant has 100 billion parameters, with up to 10 billion activated at once, and the Ultra version has 500 billion parameters with 50 billion activated at any given time.
Briski said that the fine tuning of the model is different between Nemotron 2 Nano and the Nemotron 3 models. Nemotron 2 Nano had a lot of supervised learning - meaning people correcting the model's output and feeding that back into the model - and a dash of reinforcement learning - the model teaches itself as it is used, but Nemotron 3 has a lot of reinforcement learning. The Nemotron 3 model relies heavily on reinforcement learning and also adds a context window of up to 1 million tokens.
There is a technical blog from Nvidia here that explains some of the finer points of the Nemotron 3 models, but the gist is that Mamba cuts down on memory use while capturing long-range dependencies, the transformer layers have attention algorithms that handle complex planning and reasoning, and the MoE approach allows for a model to be effectively large but only activated where necessary (an approach that Google invented with its PaLM models that came into the field after BERT).
The latent MoE feature coming in the Super and Ultra versions allows an intermediary representation layer to be added between model layers that can be shared as token processing is performed, which allows 4X the number of experts to called while delivering the same inference performance. More experts leads to better answers and more intelligence. Nemotron 3 has multi-token prediction, a kind of speculative execution for AI models, and Super and Ultra variants have been pretrained in Nvidia's NVFP4 4-bit data precision to boost effective throughput on inference. This training was done on a 25 trillion token pretraining dataset. (It is not clear that Nvidia is opening dataset this up to everyone - or that it even can.)
So how does Nemotron 3 stack up? Let's go to Artificial Analysis, which is the AI benchmark for the moment. So far, only Nemotron 3 Nano 30B/3B is available, and here is how it rates in terms of output tokens per second for inference workloads:
This is a big performance boost compared to Nemotron 2 models. Not activating the whole model clearly helps with MoEs, which is kinda the design spec.
Here is how Nemotron 3 Nano 30B/3B compares when you plot out model accuracy (intelligence, on the Y axis) against token throughput (X axis):
And finally, here is how Nemotron 3 Nano compares with the Openness Index - how open your model is - plotted on the Y axis against intelligence (correctness of answers) on the X axis:
It will be interesting to see if Nemotron 3 models can get technical support subscriptions from Nvidia either as part of the AI Enterprise stack or separately. If Nvidia offers support, it doesn't have to charge a lot but only enough to cover its model development costs to undercut the increasingly closed AI model makers.