Where it all started: The Humble Bee

After leaving my job, I was officially unemployed, but don't worry, I had plenty of time to train some models since I'm already training to be a disappointment.

Apis mellifera, commonly known as honey bee, with a custom made VR headset. It allows our little user to learn how to fly without bumping around.

A bee brain has less compute than modern smarphone chips. Bees also don't need extended periods of training. Even if you assume that part of their ability is encoded in their DNA, it's worth noting that a bee's genome is equivalent to 1.2 Gb data. That's not a lot of data.

Originally, I wanted to build an AI for drones that's as smart as a bee. From first principles, this seemed doable: bees have tiny brains, and with just a brief exposure to the real world they can navigate reliably without ever getting lost.
In ML terms, this problem should require little compute and little data (in principle!).

But I needed an environment to train the RL policy in. Smashing real drones wasn't an option, and there were no good off-the-shelf long-context world models to train in. So I had to make one myself.

In this blog post, I'll talk about building that virtual environment. There won't be any RL yet- the Bee AI has to wait for now.

A world model is a model that predict the future state of a (simulated) world given the past states and external inputs

There are many ways of building a world model, but to me the most natural way of doing so is predicting future frames based on past frames.

To build this type of world model you need:

A variational autoencoder (VAE)
An autoregressive video diffusion model

The VAE compresses video into a smaller representation, and the autoregressive diffusion model generates the actual video.

When I started this project (around February), the hottest world modeling papers were DIAMOND and GameNGen. But both suffered from extreme amnesia, so I decided to try something better.

It wasn't going to be easy. I knew Google was working on the same problem. I knew I was entering a strength competition against a titan. I knew I would lose- but I also knew I'd come out stronger than when I started!

Oh, and I needed a name for the project. I called it Oniris.

I didn't really want to train a VAE. Ideally, I'd just use an off-the-shelf one and focus on the diffusion part. But most video VAEs I found were… weird. None of them did exactly what I needed.

I needed a VAE that:

I also came across a paper arguing that video VAEs should be group causal. The reason is that you can't really do a fully causal VAE that has non-zero time compression.
Problem was, I didn't like the authors' code. So, as if the project wasn't already time-consuming enough, I decided to write my own implementation from scratch.

I thought it'd be smart to test my world modeling approach on a “simple” environment. So I fired up Gymnasium and tried compressing Lunar Lander videos.

Yeah, not so simple.

Turns out when your dataset is 98% black background and a white floor, you can get a tiny loss and a bad reconstruction at the same time.

I spent weeks trying to get it working without any hacks, but the local minima were just too good.

That's when I met Matteo online. He loved the project and jumped in to help. Unlike me, he wasn't too bothered the Bitter Lesson, instead, he suggested:

For the first time, the VAE produced meaningful reconstructions. With that solved, I could finally move on to building the world model.

With frame compression solved, I needed next-frame generation.

To do next-frame generation I math-ed out what would happen if next-token prediction and diffusion modelling had a child. Basically, Diffusion models estimate the score function $$ \mathbb s(x,\sigma)=-\mathbb\nabla_x \log p(x,\sigma) $$ and LLMs work estimate the probability of the next token given the past. $$ \textrm{LLM}(x_i,\dots,x_0)=-\log p(x_{i+1}|x_i,\dots,x_0) $$ So to generate videos autoregressively, you estimate the score conditioned on previous frames. It's essentially image generation conditioned on history. $$ \mathbb s(x_i,\sigma;x_{i-1},\dots,x_0)=-\mathbb\nabla_x \log p(x_i|\sigma;x_{i-1},\dots,x_0) $$

The real challenge wasn't just building the model- it was exploiting the transformer's ability to train in a parallelizable way across the sequence dimension. Without that, there was no chance of scaling to long sequences.

Unlike RNNs, transformers predict all tokens in parallel during training: each position in the sequence provides a supervised signal simultaneously. This sequence-level parallelism gives transformers far greater training efficiency.

But why does sequence-parallel training matter?

When I first learned about language modeling, I made a rookie mistake: I trained a model to predict only the last token in a sentence. The result? It was shit. The model didn't learn a thing.

Why? Because transformers don't make just one prediction per forward pass- they make one prediction per token. If your sequence has 128 tokens, that's 128 supervised training signals, not just one. That parallelism is the difference between a working model and a random number generator.

For Oniris, leveraging sequence-level parallelism wasn't optional- it was non-negotiable.

But there’s a catch: implementing this naively creates a mismatch. If you feed an entire noised sequence into a causal transformer during training, the model sees noisy context everywhere. At inference time, however, past frames are clean while only the current frame is noised. This mismatch breaks autoregressive video generation.

Wrong approach: during training the model sees every frame noised, while during inference all but the last frame are clean. This mismatch leads to poor performace.

I needed a way to make sure that the information given to each noisy frame came from clean-past frames.

DART: Duplicate the sequence. One copy is noised, where the loss is evaluated. The other copy stays clean and provides context. Information flows only from past clean frames into the noised ones. See the next image to see in detail how information flows.

DART: Duplicate Augmented Replica Training

After some thinking, I figured out "the trick" for having sequence-parallelism during training and named it DART (Duplicate Augmented Replica Training).

Here is how it works:

Duplicate the sequence.
Add noise to one copy.
Keep the other copy clean as context.
Ensure information flows only from clean past frames into the noisy ones.

This way, each noisy frame can still learn from past clean frames, while preserving causal information flow. A carefully designed masking scheme enforces this.

At inference time things are simpler: standard causal masking suffices, and the model generates video autoregressively without mismatch.

UNet-based DART architecture inspired from. The first layers in light gray are convolutional layers, then as you get closer to the bottleneck self-attention layers (in dark grey) are present, then in the bottleneck DART-layers (colored) are present. This way the information is compressed before getting passed to DART layers.

One of the drawbacks of DART is the higher memory footprint. The size of the attention matrix becomes $2\times T \times H\times W$ where $T, H, W$ are respectively the number of frames, the height and the width of the frames.

To improve performace I decided to use a UNet-like architecture. This allowed me to save lots and lots of memory by using DART only in the bottleneck of the UNet. Results: Using DART I was able to:

Exploit sequence-level parallelism during training
Squeeze every drop of compute out of the GPU during training thanks to Flex-Attention
During inference use a generalized KV-caching to substantially speed up the generation time

But did it work? YES!

After ~4h of training I got the model to simulate the lunar lander gym environment!

And with that the Oniris architecture was now ready and tested! I was so excited! I was ready to scale up!

If you want to know more on how it works, check out the repository

So now, it was time to move onto the real deal. All I had to do was swap out the gymnasium environment for gameplay of counterstrike, increase the parameter count and everything would work fine. Then the project would be complete! Afterall we all know that scaling is fool proof and works 100% of the time without any hiccups whatsoever right? Right?

My first serious run of training a counterstrike VAE on a 4090 went pretty well, I was so excited!

I managed to run some jobs with 8×H100s for about two hours, but… things weren't so straightforward. No matter what I tried, the diffusion model either refused to learn, or we simply didn't have enough compute. It was impossible to tell which.

By then, we had burned through about 10% of our credits. I considered going full YOLO, but something told me it would fail. So I checked the numbers from DIAMOND and GameNGen:

Model	Compute Used	Equivalent (H100, bf16)	Works?
DIAMOND	12 RTX 4090-days	≈ 3 H100-days	✅
GameNGen	180 TPUv5e-days	≈ 3 H100-days	✅
Oniris (mine)	16 H100-hours	≈ 0.7 H100-days	❌

So what was wrong? After a lot of head-scratching, I pinned the blame on the VAE. Its compression ratio was only 24×, and that probably wasn't enough.

Back to square one: retraining the VAE, this time aiming for a 96× compression ratio.

Training results of the VAE with 96x compression. Pure garbage.

This is where things take a dark turn folks. Everything was shit. I tried to debug the model for months to no avail. I tried everything.

It's hard to convey with words the sheer amount of things I tried and NONE of them worked. The model simply would not compress beyond 24x and I needed 4 times as much! I was wasting so much time, running out of energy and still nothing seemed to work.

I realised I was going nowhere, and so I chose to take a month of break to reset.

During that time, Matteo and I started a small business called Noteician. We even got our first customers (check it out!).

I tried to resist the urge to go back to work on Oniris, but then, one night something surreal pulled me back to the project…

One night, I dreamt of deriving the Mean Squared Error (MSE) loss.
In the dream, I saw that its derivation hinged on the assumption of constant data uncertainty, where the only goal is to estimate the mean.
Isn't this too restrictive? Why shouldn't our model also account for the uncertainty itself? Surely, there had to be a way.

Then I remembered that for classification tasks the cross-entropy loss implicitly handles uncertainty. This sparked an idea:

When I woke up I jumped to my desk, took pen and paper, wrote the gaussian probability distribution, and calculated the loss function.

Suppose we have some datapoints $X_i$ and $Y_i$ we want to predict the ground truth probability $p(y|x)$.

We don't have direct access to $p$, so we are going to approximate $p(y|x)$ with $q(y|x)$ where is a gaussian where the mean and the variance are learned functions $\mu_\theta(x)$ and $\sigma_\phi(x)$ $$ q(y|x) = \frac {1}{\sqrt{2 \pi \sigma_\phi^2}} \exp -\frac{(y-\mu_\theta)^2}{2\sigma_\phi^2} $$ The loss is simply going to be the negative log likelyhood of $q(y|x)$ $$ L = -\log q(y|x) = -\log\left[\frac {1}{\sqrt{2 \pi \sigma_\phi^2}} \exp -\frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}\right] $$ Doing some simple calculations, get that the loss is equal to $$ L = \log \sigma_\phi + \frac{(y-\mu_\theta)^2}{2\sigma_\phi^2} + \textrm{const} $$ As a sanity check, you can see that if we consider $\sigma_\phi$ to be a constant, we recover the $L_2$ loss. Small footnote about computational stability:

The loss function as written like this $$L = \log \sigma_\phi + \frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}$$ can have stability issues because of $\log\sigma_\phi$ and $\sigma^{-2}_\phi$ terms.
This goes away by making the network predict the logvar $l_\phi$ $$l_\phi = \log \sigma^2_\phi$$ Substituiting this into the loss equation we get $$L = l_\phi + (y-\mu_\theta)^2e^{-l_\phi}$$ This is stable both numerically and during training

Now it comes the interesting part:
If we want to find the minimum with respect to $\sigma_\phi$ we just set it's derivative to zero $$ \frac{\partial L}{\partial \sigma_\phi} = \frac 1{\sigma_\phi} - \frac {(y-\mu_\theta)^2}{\sigma_\phi^3}=0 $$ And we get that the loss is at the minimum when $$ \sigma_\phi^2 = (y-\mu_\theta)^2 $$ This means that $\sigma_\phi$ learns to estimate the expected prediction error!

On the other end, if we calculate the gradient with respect to $\mu_\theta$ we get $$ \frac{\partial L}{\partial \mu_\theta} = \frac{y-\mu_\theta}{\sigma_\phi^2} $$

Dummy dataset where I tested this loss function. As you can see the model manages to predict perfectly the mean and variance of the datapoints.
I've written a colab if you want to check it out. Be careful, there are several local minima, so you might need to run the training several times

This means that the gradient to find the expected value of $y$ is exactly the same as the one of the $L_2$ loss, but here each sample is weighted by the expect sample error $\sigma_\phi$- Which is exactly what you want! This also solves the $L_1$ vs $L_2$ debate.

In short, people shy away form using $L_2$ loss in VAEs because it penalizes large errors more than $L_1$. Thus, using $L_2$ leads to blurry images and makes it harder for the auxiliary losses such as LPIPS and GAN losses to do their job.
This lead people to use the $L_1$ loss because it's more "kind".

That's not math, that's bro-science.

The loss I've just introduced fixes it!

As you'll see in a later section the adaptive uncertainty estimation allows the model to follow the $L_2$ loss where it's more confident in his prediction, and follows more the auxiliary losses where it's less confident.

Problem solved.

Sanity check

To perform a sanity check, I ran a quick linear regression where the standard deviation of the samples was not constant. My goal was to see if a model trained with this new loss function could correctly infer the changing error band. And it did it!

After some research, it seems that this method was already discovered and it even had a Pytorch page.

Curiously, this approach doesn't seem to be used in the image generation literature. I think I'm the first one to try it! With this renewed sense of purpose, I was ready to dive back in.

Back to the VAE

I started reading again tons and tons and tons of stuff. At some point I found this amazing blog post by OpenWorldLabs that explained how to train diffusable autoencoders.
It was a gold mine of information, so I implemented lots of stuff that was in there as well!

Finally, after blood sweat and tears, I now have a 96x compression VAE and just look at the results: *dramatic music*

It's really good! Moreover the because of the fact that the errors are weighted by the predicted confidence $\sigma_\phi$, the learning process pushes the model to follow more the auxiliary losses (LPIPS and GAN loss) in the areas where there is high uncertainty, and it forces the model to follow the $L_2$ loss when $\sigma_\phi$ is low.

Now that I had a decent VAE I was able to train the world model much faster and more efficiently, however...

It was still shit.

Even with the improved latents, it wouldn't work correctly. I'm still trying to figure out where the problem lies, but for now, the story pauses here. No flashy interactive demo- just lessons learned.

While we were working on this, Google revealed Genie 3. The titan won the strength contest. Their demo is breathtaking- and in many ways, exactly what I wanted to build.

But this doesn't discourage me. It only shows what's possible. If Genie3 proves anything, it's that this approach is viable. There's still space to innovate- to build efficient world models that run on consumer hardware.

So if you've got spare GPU hours, a knack for coding, or just a wild idea you want to test- come help.
Me and Matteo are building this one piece at a time, and it's way more fun with company.
After all, bees work best in swarms 🐝

Oniris

Authors

Affiliations

Date

DOI

Acknowledgements

Constact

Citation