A small story about building a long-context World Model with no money
After leaving my job, I was officially unemployed, but don't worry, I had plenty of time to train some models since I'm already training to be a disappointment.
Originally, I wanted to build an AI for drones that's as smart as a bee. From first principles, this seemed doable: bees have tiny brains, and with just a brief exposure to the real world they can navigate reliably without ever getting lost.
In ML terms, this problem should require little compute and little data (in principle!).
But I needed an environment to train the RL policy in. Smashing real drones wasn't an option, and there were no good off-the-shelf long-context world models to train in. So I had to make one myself.
In this blog post, I'll talk about building that virtual environment. There won't be any RL yet- the Bee AI has to wait for now.
A world model is a model that predict the future state of a (simulated) world given the past states and external inputsThere are many ways of building a world model
When I started this project (around February), the hottest world modeling papers were DIAMOND
It wasn't going to be easy. I knew Google was working on the same problem. I knew I was entering a strength competition against a titan. I knew I would lose- but I also knew I'd come out stronger than when I started!
Oh, and I needed a name for the project. I called it Oniris.
I didn't really want to train a VAE. Ideally, I'd just use an off-the-shelf one and focus on the diffusion part. But most video VAEs I found were… weird. None of them did exactly what I needed.
I needed a VAE that:
I also came across a paper arguing that video VAEs should be group causal
Problem was, I didn't like the authors' code. So, as if the project wasn't already time-consuming enough, I decided to write my own implementation from scratch.
I thought it'd be smart to test my world modeling approach on a “simple” environment. So I fired up Gymnasium and tried compressing Lunar Lander videos.
Yeah, not so simple.
Turns out when your dataset is 98% black background and a white floor, you can get a tiny loss and a bad reconstruction at the same time.
I spent weeks trying to get it working without any hacks, but the local minima were just too good.
That's when I met Matteo online. He loved the project and jumped in to help. Unlike me, he wasn't too bothered the Bitter Lesson
“What if we only calculate the loss on the top 0.5% of pixels with the highest error?”At first, I hesitated- it felt like smuggling knowledge into the loss function. But in the end, the one who needed convincing wasn't me, it was the computer. So he tried it.
For the first time, the VAE produced meaningful reconstructions. With that solved, I could finally move on to building the world model.
With frame compression solved, I needed next-frame generation.
To do next-frame generation I math-ed out what would happen if next-token prediction and diffusion modelling had a child.
Basically, Diffusion models estimate the score function
$$
\mathbb s(x,\sigma)=-\mathbb\nabla_x \log p(x,\sigma)
$$
and LLMs work estimate the probability of the next token given the past.
$$
\textrm{LLM}(x_i,\dots,x_0)=-\log p(x_{i+1}|x_i,\dots,x_0)
$$
So to generate videos autoregressively, you estimate the score conditioned on previous frames. It's essentially image generation conditioned on history.
$$
\mathbb s(x_i,\sigma;x_{i-1},\dots,x_0)=-\mathbb\nabla_x \log p(x_i|\sigma;x_{i-1},\dots,x_0)
$$
The real challenge wasn't just building the model- it was exploiting the transformer's ability to train in a parallelizable way across the sequence dimension. Without that, there was no chance of scaling to long sequences.
But why does sequence-parallel training matter?
When I first learned about language modeling, I made a rookie mistake: I trained a model to predict only the last token in a sentence. The result? It was shit. The model didn't learn a thing.
Why? Because transformers don't make just one prediction per forward pass- they make one prediction per token. If your sequence has 128 tokens, that's 128 supervised training signals, not just one. That parallelism is the difference between a working model and a random number generator.
For Oniris, leveraging sequence-level parallelism wasn't optional- it was non-negotiable.
But there’s a catch: implementing this naively creates a mismatch. If you feed an entire noised sequence into a causal transformer during training, the model sees noisy context everywhere. At inference time, however, past frames are clean while only the current frame is noised. This mismatch breaks autoregressive video generation.
I needed a way to make sure that the information given to each noisy frame came from clean-past frames.
After some thinking, I figured out "the trick" for having sequence-parallelism during training and named it DART (Duplicate Augmented Replica Training).
Here is how it works:
This way, each noisy frame can still learn from past clean frames, while preserving causal information flow. A carefully designed masking scheme enforces this.
At inference time things are simpler: standard causal masking suffices, and the model generates video autoregressively without mismatch.
One of the drawbacks of DART is the higher memory footprint. The size of the attention matrix becomes $2\times T \times H\times W$ where $T, H, W$ are respectively the number of frames, the height and the width of the frames.
To improve performace I decided to use a UNet-like architecture
But did it work? YES!
After ~4h of training I got the model to simulate the lunar lander gym environment!
And with that the Oniris architecture was now ready and tested! I was so excited! I was ready to scale up!
If you want to know more on how it works, check out the repository
So now, it was time to move onto the real deal. All I had to do was swap out the gymnasium environment for gameplay of counterstrike, increase the parameter count and everything would work fine. Then the project would be complete! Afterall we all know that scaling is fool proof and works 100% of the time without any hiccups whatsoever right? Right?
My first serious run of training a counterstrike VAE on a 4090 went pretty well, I was so excited!
[t,h,w,3] -> [t/4,h/4,w/4,8]
For training the diffusion model we even secured compute credits:
It felt like swimming in money! We love you both ❤️
I managed to run some jobs with 8×H100s for about two hours, but… things weren't so straightforward. No matter what I tried, the diffusion model either refused to learn, or we simply didn't have enough compute. It was impossible to tell which.
By then, we had burned through about 10% of our credits. I considered going full YOLO, but something told me it would fail. So I checked the numbers from DIAMOND and GameNGen:
Model | Compute Used | Equivalent (H100, bf16) | Works? |
---|---|---|---|
DIAMOND |
12 RTX 4090-days | ≈ 3 H100-days | ✅ |
GameNGen |
180 TPUv5e-days | ≈ 3 H100-days | ✅ |
Oniris (mine) | 16 H100-hours | ≈ 0.7 H100-days | ❌ |
So what was wrong? After a lot of head-scratching, I pinned the blame on the VAE. Its compression ratio was only 24×, and that probably wasn't enough.
Back to square one: retraining the VAE, this time aiming for a 96× compression ratio.
This is where things take a dark turn folks. Everything was shit. I tried to debug the model for months to no avail. I tried everything.
It's hard to convey with words the sheer amount of things I tried and NONE of them worked. The model simply would not compress beyond 24x and I needed 4 times as much! I was wasting so much time, running out of energy and still nothing seemed to work.
I realised I was going nowhere, and so I chose to take a month of break to reset.
During that time, Matteo and I started a small business called Noteician. We even got our first customers (check it out!).
I tried to resist the urge to go back to work on Oniris, but then, one night something surreal pulled me back to the project…
One night, I dreamt of deriving the Mean Squared Error (MSE) loss.
In the dream, I saw that its derivation hinged on the assumption of constant data uncertainty, where the only goal is to estimate the mean.
Isn't this too restrictive? Why shouldn't our model also account for the uncertainty itself? Surely, there had to be a way.
Then I remembered that for classification tasks the cross-entropy loss implicitly handles uncertainty. This sparked an idea:
Could I achieve the same for regression by calculating the negative log-likelihood of a Gaussian distribution, but with a non-constant standard deviation?
When I woke up I jumped to my desk, took pen and paper, wrote the gaussian probability distribution, and calculated the loss function.
Suppose we have some datapoints $X_i$ and $Y_i$ we want to predict the ground truth probability $p(y|x)$.
We don't have direct access to $p$, so we are going to approximate $p(y|x)$ with $q(y|x)$ where is a gaussian where the mean and the variance are learned functions $\mu_\theta(x)$ and $\sigma_\phi(x)$
$$
q(y|x) = \frac {1}{\sqrt{2 \pi \sigma_\phi^2}} \exp -\frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}
$$
The loss is simply going to be the negative log likelyhood of $q(y|x)$
$$
L = -\log q(y|x) = -\log\left[\frac {1}{\sqrt{2 \pi \sigma_\phi^2}} \exp -\frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}\right]
$$
Doing some simple calculations, get that the loss is equal to
$$
L = \log \sigma_\phi + \frac{(y-\mu_\theta)^2}{2\sigma_\phi^2} + \textrm{const}
$$
As a sanity check, you can see that if we consider $\sigma_\phi$ to be a constant, we recover the $L_2$ loss.
The loss function as written like this $$L = \log \sigma_\phi + \frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}$$ can have stability issues because of $\log\sigma_\phi$ and $\sigma^{-2}_\phi$ terms.
This goes away by making the network predict the logvar $l_\phi$ $$l_\phi = \log \sigma^2_\phi$$ Substituiting this into the loss equation we get $$L = l_\phi + (y-\mu_\theta)^2e^{-l_\phi}$$ This is stable both numerically and during training
Now it comes the interesting part:
If we want to find the minimum with respect to $\sigma_\phi$ we just set it's derivative to zero
$$
\frac{\partial L}{\partial \sigma_\phi} = \frac 1{\sigma_\phi} - \frac {(y-\mu_\theta)^2}{\sigma_\phi^3}=0
$$
And we get that the loss is at the minimum when
$$
\sigma_\phi^2 = (y-\mu_\theta)^2
$$
This means that $\sigma_\phi$ learns to estimate the expected prediction error!
This means that the gradient to find the expected value of $y$ is exactly the same as the one of the $L_2$ loss, but here each sample is weighted by the expect sample error $\sigma_\phi$- Which is exactly what you want!
In short, people shy away form using $L_2$ loss in VAEs because it penalizes large errors more than $L_1$. Thus, using $L_2$ leads to blurry images and makes it harder for the auxiliary losses such as LPIPS and GAN losses to do their job.
This lead people to use the $L_1$ loss because it's more "kind".
That's not math, that's bro-science.
The loss I've just introduced fixes it!
As you'll see in a later section the adaptive uncertainty estimation allows the model to follow the $L_2$ loss where it's more confident in his prediction, and follows more the auxiliary losses where it's less confident.
Problem solved.
To perform a sanity check, I ran a quick linear regression where the standard deviation of the samples was not constant. My goal was to see if a model trained with this new loss function could correctly infer the changing error band. And it did it!
After some research, it seems that this method was already discovered
Curiously, this approach doesn't seem to be used in the image generation literature. I think I'm the first one to try it! With this renewed sense of purpose, I was ready to dive back in.
I started reading again tons and tons and tons of stuff. At some point I found this amazing blog post by OpenWorldLabs that explained how to train diffusable autoencoders.
It was a gold mine of information, so I implemented lots of stuff that was in there as well!
Finally, after blood sweat and tears, I now have a 96x compression VAE and just look at the results: *dramatic music*
It's really good! Moreover the because of the fact that the errors are weighted by the predicted confidence $\sigma_\phi$, the learning process pushes the model to follow more the auxiliary losses (LPIPS and GAN loss) in the areas where there is high uncertainty, and it forces the model to follow the $L_2$ loss when $\sigma_\phi$ is low.
Now that I had a decent VAE I was able to train the world model much faster and more efficiently, however...
It was still shit.
Even with the improved latents, it wouldn't work correctly. I'm still trying to figure out where the problem lies, but for now, the story pauses here. No flashy interactive demo- just lessons learned.
While we were working on this, Google revealed Genie 3. The titan won the strength contest. Their demo is breathtaking- and in many ways, exactly what I wanted to build.
But this doesn't discourage me. It only shows what's possible. If Genie3 proves anything, it's that this approach is viable. There's still space to innovate- to build efficient world models that run on consumer hardware.
So if you've got spare GPU hours, a knack for coding, or just a wild idea you want to test- come help.
Me and Matteo are building this one piece at a time, and it's way more fun with company.
After all, bees work best in swarms 🐝
Thanks to Davide Locatelli for teaching us how to tell good stories with blog posts. Check out his blog!
Thanks to Gianluigi Silvestri for the informal peer-review and tips.
We also want to thank Lambda and SF Compute for the free credits.
You can do so by sending an email to this address francesco215@live.it or by messaging on discord at sacco215
For attribution in academic contexts, please cite this work as
Sacco, et al., "Oniris", zenodo, 2025
BibTeX citation
@article{sacco2025Oniris author = {Sacco, Francesco and Peluso, Matteo}, title= {Oniris}, journal = {Zenodo}, year = {2025}, doi = {10.5281/zenodo.16927467}, url = {francesco215.github.io/autoregressive_diffusion/} }