Oniris

A small story about building a long-context World Model with no money

Authors

Affiliations

Francesco Sacco (POV)

None

Matteo Peluso

University of Zurich

Date

August 22th 2025

DOI

10.5281/zenodo.16927467

Where it all started: The Humble Bee

After leaving my job, I was officially unemployed, but don't worry, I had plenty of time to train some models since I'm already training to be a disappointment.

Apis mellifera, commonly known as honey bee, with a custom made VR headset. It allows our little user to learn how to fly without bumping around.

A bee brain has less compute than modern smarphone chips. Bees also don't need extended periods of training. Even if you assume that part of their ability is encoded in their DNA, it's worth noting that a bee's genome is equivalent to 1.2 Gb data. That's not a lot of data.

Originally, I wanted to build an AI for drones that's as smart as a bee. From first principles, this seemed doable: bees have tiny brains, and with just a brief exposure to the real world they can navigate reliably without ever getting lost.
In ML terms, this problem should require little compute and little data (in principle!).

But I needed an environment to train the RL policy in. Smashing real drones wasn't an option, and there were no good off-the-shelf long-context world models to train in. So I had to make one myself.


In this blog post, I'll talk about building that virtual environment. There won't be any RL yet- the Bee AI has to wait for now.

How to build a world model

A world model is a model that predict the future state of a (simulated) world given the past states and external inputs
There are many ways of building a world model, but to me the most natural way of doing so is predicting future frames based on past frames.

To build this type of world model you need: The VAE compresses video into a smaller representation, and the autoregressive diffusion model generates the actual video.

When I started this project (around February), the hottest world modeling papers were DIAMOND and GameNGen. But both suffered from extreme amnesia, so I decided to try something better.

It wasn't going to be easy. I knew Google was working on the same problem. I knew I was entering a strength competition against a titan. I knew I would lose- but I also knew I'd come out stronger than when I started!

Oh, and I needed a name for the project. I called it Oniris.

The first version of the VAE

I didn't really want to train a VAE. Ideally, I'd just use an off-the-shelf one and focus on the diffusion part. But most video VAEs I found were… weird. None of them did exactly what I needed.

I needed a VAE that:

I also came across a paper arguing that video VAEs should be group causal. The reason is that you can't really do a fully causal VAE that has non-zero time compression.
Problem was, I didn't like the authors' code. So, as if the project wasn't already time-consuming enough, I decided to write my own implementation from scratch.

An interactive visualization of the group-causal VAE architecture.
The encoder (left) compresses the input into a latent space with 4x time compression, and the decoder (right) reconstructs it.

When you hover over one of the nodes you will see highlighted all of the nodes that causally affect or are causally affected by it.

Now, try to see what's the difference between causal and group causal.

Even when the network tries to be fully causal it is unable to do so because of the time-compression.

For example if you highlight the last node of a group on the encoder side, you will see that it affects the first node of the same group on the decoder side even if the VAE is fully causal

The group-causal VAE is fully causal in latent space and group-casual in pixel space.
Nothing is Ever Simple

I thought it'd be smart to test my world modeling approach on a “simple” environment. So I fired up Gymnasium and tried compressing Lunar Lander videos.

Yeah, not so simple.

The fuzzines of the lander seemed impossible to remove. It was present no matter the choice of hyperparameters or the choice of $L_1$ vs $L_2$ loss. It was so frustrating

Turns out when your dataset is 98% black background and a white floor, you can get a tiny loss and a bad reconstruction at the same time.

The Trick

I spent weeks trying to get it working without any hacks, but the local minima were just too good.

That's when I met Matteo online. He loved the project and jumped in to help. Unlike me, he wasn't too bothered the Bitter Lesson, instead, he suggested:

“What if we only calculate the loss on the top 0.5% of pixels with the highest error?”
At first, I hesitated- it felt like smuggling knowledge into the loss function. But in the end, the one who needed convincing wasn't me, it was the computer. So he tried it.

And it worked! Beautifully so!

Quality of the VAE reconstruction if we only calculate the loss on the top 0.5% of pixels with the highest error. It even works in different environments!

For the first time, the VAE produced meaningful reconstructions. With that solved, I could finally move on to building the world model.

The World Model

With frame compression solved, I needed next-frame generation.

To do next-frame generation I math-ed out what would happen if next-token prediction and diffusion modelling had a child. Basically, Diffusion models estimate the score function $$ \mathbb s(x,\sigma)=-\mathbb\nabla_x \log p(x,\sigma) $$ and LLMs work estimate the probability of the next token given the past. $$ \textrm{LLM}(x_i,\dots,x_0)=-\log p(x_{i+1}|x_i,\dots,x_0) $$ So to generate videos autoregressively, you estimate the score conditioned on previous frames. It's essentially image generation conditioned on history. $$ \mathbb s(x_i,\sigma;x_{i-1},\dots,x_0)=-\mathbb\nabla_x \log p(x_i|\sigma;x_{i-1},\dots,x_0) $$

The Real Challenge: Sequence-Level Parallelism

The real challenge wasn't just building the model- it was exploiting the transformer's ability to train in a parallelizable way across the sequence dimension. Without that, there was no chance of scaling to long sequences.

Unlike RNNs, transformers predict all tokens in parallel during training: each position in the sequence provides a supervised signal simultaneously. This sequence-level parallelism gives transformers far greater training efficiency.

But why does sequence-parallel training matter?

When I first learned about language modeling, I made a rookie mistake: I trained a model to predict only the last token in a sentence. The result? It was shit. The model didn't learn a thing.

Why? Because transformers don't make just one prediction per forward pass- they make one prediction per token. If your sequence has 128 tokens, that's 128 supervised training signals, not just one. That parallelism is the difference between a working model and a random number generator.

For Oniris, leveraging sequence-level parallelism wasn't optional- it was non-negotiable.

But there’s a catch: implementing this naively creates a mismatch. If you feed an entire noised sequence into a causal transformer during training, the model sees noisy context everywhere. At inference time, however, past frames are clean while only the current frame is noised. This mismatch breaks autoregressive video generation.

Wrong approach: during training the model sees every frame noised, while during inference all but the last frame are clean. This mismatch leads to poor performace.

I needed a way to make sure that the information given to each noisy frame came from clean-past frames.

DART: Duplicate the sequence. One copy is noised, where the loss is evaluated. The other copy stays clean and provides context. Information flows only from past clean frames into the noised ones. See the next image to see in detail how information flows.
DART: Duplicate Augmented Replica Training

After some thinking, I figured out "the trick" for having sequence-parallelism during training and named it DART (Duplicate Augmented Replica Training).

Here is how it works:

This way, each noisy frame can still learn from past clean frames, while preserving causal information flow. A carefully designed masking scheme enforces this.

At inference time things are simpler: standard causal masking suffices, and the model generates video autoregressively without mismatch.

This interactive visualization explains how attention masking in DART works. Hover over any box to see its connections. Click on a noisy query to see how one step of autoregressive inference works and how it relates with a step in the training mode.
UNet-based DART architecture inspired from. The first layers in light gray are convolutional layers, then as you get closer to the bottleneck self-attention layers (in dark grey) are present, then in the bottleneck DART-layers (colored) are present. This way the information is compressed before getting passed to DART layers.


One of the drawbacks of DART is the higher memory footprint. The size of the attention matrix becomes $2\times T \times H\times W$ where $T, H, W$ are respectively the number of frames, the height and the width of the frames.

To improve performace I decided to use a UNet-like architecture. This allowed me to save lots and lots of memory by using DART only in the bottleneck of the UNet. Results: Using DART I was able to:

But did it work? YES!

In the image above the first three rows are given as context, the next 3 rows are generated by Oniris. As you can see the model works pretty well!

After ~4h of training I got the model to simulate the lunar lander gym environment!

And with that the Oniris architecture was now ready and tested! I was so excited! I was ready to scale up!

If you want to know more on how it works, check out the repository

Scaling Up

So now, it was time to move onto the real deal. All I had to do was swap out the gymnasium environment for gameplay of counterstrike, increase the parameter count and everything would work fine. Then the project would be complete! Afterall we all know that scaling is fool proof and works 100% of the time without any hiccups whatsoever right? Right?

My first serious run of training a counterstrike VAE on a 4090 went pretty well, I was so excited!

Training run for the CounterStrike VAE. It has a compression of 24x [t,h,w,3] -> [t/4,h/4,w/4,8]

For training the diffusion model we even secured compute credits:

It felt like swimming in money! We love you both ❤️

SF Compute Logo Lambda Logo
Also the prices is SF Compute were really low! that meant that the 300$ were worth way more!

I managed to run some jobs with 8×H100s for about two hours, but… things weren't so straightforward. No matter what I tried, the diffusion model either refused to learn, or we simply didn't have enough compute. It was impossible to tell which.

Training dashboard for one of my compute heavy counterstrike runs. As you can see the average loss plataus and the generated images are very fuzzy

By then, we had burned through about 10% of our credits. I considered going full YOLO, but something told me it would fail. So I checked the numbers from DIAMOND and GameNGen:

Model Compute Used Equivalent (H100, bf16) Works?
DIAMOND 12 RTX 4090-days ≈ 3 H100-days
GameNGen 180 TPUv5e-days ≈ 3 H100-days
Oniris (mine) 16 H100-hours ≈ 0.7 H100-days

So what was wrong? After a lot of head-scratching, I pinned the blame on the VAE. Its compression ratio was only 24×, and that probably wasn't enough.

Back to square one: retraining the VAE, this time aiming for a 96× compression ratio.

The Valley of Despair
Training results of the VAE with 96x compression. Pure garbage.

This is where things take a dark turn folks. Everything was shit. I tried to debug the model for months to no avail. I tried everything.

It's hard to convey with words the sheer amount of things I tried and NONE of them worked. The model simply would not compress beyond 24x and I needed 4 times as much! I was wasting so much time, running out of energy and still nothing seemed to work.

The commit graph during The Valley of Despair. The sheer amount of dead ends is, frankly, defeating.
Taking a break

I realised I was going nowhere, and so I chose to take a month of break to reset.

During that time, Matteo and I started a small business called Noteician. We even got our first customers (check it out!).

I tried to resist the urge to go back to work on Oniris, but then, one night something surreal pulled me back to the project…

The Dream

One night, I dreamt of deriving the Mean Squared Error (MSE) loss.
In the dream, I saw that its derivation hinged on the assumption of constant data uncertainty, where the only goal is to estimate the mean.
Isn't this too restrictive? Why shouldn't our model also account for the uncertainty itself? Surely, there had to be a way.

Then I remembered that for classification tasks the cross-entropy loss implicitly handles uncertainty. This sparked an idea:

Could I achieve the same for regression by calculating the negative log-likelihood of a Gaussian distribution, but with a non-constant standard deviation?

When I woke up I jumped to my desk, took pen and paper, wrote the gaussian probability distribution, and calculated the loss function.

The Theory

Suppose we have some datapoints $X_i$ and $Y_i$ we want to predict the ground truth probability $p(y|x)$.

We don't have direct access to $p$, so we are going to approximate $p(y|x)$ with $q(y|x)$ where is a gaussian where the mean and the variance are learned functions $\mu_\theta(x)$ and $\sigma_\phi(x)$ $$ q(y|x) = \frac {1}{\sqrt{2 \pi \sigma_\phi^2}} \exp -\frac{(y-\mu_\theta)^2}{2\sigma_\phi^2} $$ The loss is simply going to be the negative log likelyhood of $q(y|x)$ $$ L = -\log q(y|x) = -\log\left[\frac {1}{\sqrt{2 \pi \sigma_\phi^2}} \exp -\frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}\right] $$ Doing some simple calculations, get that the loss is equal to $$ L = \log \sigma_\phi + \frac{(y-\mu_\theta)^2}{2\sigma_\phi^2} + \textrm{const} $$ As a sanity check, you can see that if we consider $\sigma_\phi$ to be a constant, we recover the $L_2$ loss. Small footnote about computational stability:

The loss function as written like this $$L = \log \sigma_\phi + \frac{(y-\mu_\theta)^2}{2\sigma_\phi^2}$$ can have stability issues because of $\log\sigma_\phi$ and $\sigma^{-2}_\phi$ terms.
This goes away by making the network predict the logvar $l_\phi$ $$l_\phi = \log \sigma^2_\phi$$ Substituiting this into the loss equation we get $$L = l_\phi + (y-\mu_\theta)^2e^{-l_\phi}$$ This is stable both numerically and during training

Properties of this Loss

Now it comes the interesting part:
If we want to find the minimum with respect to $\sigma_\phi$ we just set it's derivative to zero $$ \frac{\partial L}{\partial \sigma_\phi} = \frac 1{\sigma_\phi} - \frac {(y-\mu_\theta)^2}{\sigma_\phi^3}=0 $$ And we get that the loss is at the minimum when $$ \sigma_\phi^2 = (y-\mu_\theta)^2 $$ This means that $\sigma_\phi$ learns to estimate the expected prediction error!

On the other end, if we calculate the gradient with respect to $\mu_\theta$ we get $$ \frac{\partial L}{\partial \mu_\theta} = \frac{y-\mu_\theta}{\sigma_\phi^2} $$

Dummy dataset where I tested this loss function. As you can see the model manages to predict perfectly the mean and variance of the datapoints.
I've written a colab if you want to check it out. Be careful, there are several local minima, so you might need to run the training several times

This means that the gradient to find the expected value of $y$ is exactly the same as the one of the $L_2$ loss, but here each sample is weighted by the expect sample error $\sigma_\phi$- Which is exactly what you want! This also solves the $L_1$ vs $L_2$ debate.

In short, people shy away form using $L_2$ loss in VAEs because it penalizes large errors more than $L_1$. Thus, using $L_2$ leads to blurry images and makes it harder for the auxiliary losses such as LPIPS and GAN losses to do their job.
This lead people to use the $L_1$ loss because it's more "kind".

That's not math, that's bro-science.

The loss I've just introduced fixes it!

As you'll see in a later section the adaptive uncertainty estimation allows the model to follow the $L_2$ loss where it's more confident in his prediction, and follows more the auxiliary losses where it's less confident.

Problem solved.

Sanity check

To perform a sanity check, I ran a quick linear regression where the standard deviation of the samples was not constant. My goal was to see if a model trained with this new loss function could correctly infer the changing error band. And it did it!

After some research, it seems that this method was already discovered and it even had a Pytorch page.

Curiously, this approach doesn't seem to be used in the image generation literature. I think I'm the first one to try it! With this renewed sense of purpose, I was ready to dive back in.

Back to the VAE

I started reading again tons and tons and tons of stuff. At some point I found this amazing blog post by OpenWorldLabs that explained how to train diffusable autoencoders.
It was a gold mine of information, so I implemented lots of stuff that was in there as well!

Finally, after blood sweat and tears, I now have a 96x compression VAE and just look at the results: *dramatic music*

The first row represents the original frames $y$, the second row the reconstructed image $\mu_\theta$ and the third row the uncertainty $\sigma_\phi$.
The quality of this new VAE is substantially better! As you can see the areas where the model is less precise (mainly some edges and tree leafs) are highlighed in the uncertanty heatmap

It's really good! Moreover the because of the fact that the errors are weighted by the predicted confidence $\sigma_\phi$, the learning process pushes the model to follow more the auxiliary losses (LPIPS and GAN loss) in the areas where there is high uncertainty, and it forces the model to follow the $L_2$ loss when $\sigma_\phi$ is low.

Back to the World Model

Now that I had a decent VAE I was able to train the world model much faster and more efficiently, however...

It was still shit.

Training dashboard of the world model with the new VAE. The model doesn't seem to learn effectively. It's not pure noise, but clearly something is not working correctly

Even with the improved latents, it wouldn't work correctly. I'm still trying to figure out where the problem lies, but for now, the story pauses here. No flashy interactive demo- just lessons learned.

Where are we now

While we were working on this, Google revealed Genie 3. The titan won the strength contest. Their demo is breathtaking- and in many ways, exactly what I wanted to build.

But this doesn't discourage me. It only shows what's possible. If Genie3 proves anything, it's that this approach is viable. There's still space to innovate- to build efficient world models that run on consumer hardware.

So if you've got spare GPU hours, a knack for coding, or just a wild idea you want to test- come help.
Me and Matteo are building this one piece at a time, and it's way more fun with company.
After all, bees work best in swarms 🐝

Acknowledgements

Thanks to Davide Locatelli for teaching us how to tell good stories with blog posts. Check out his blog!
Thanks to Gianluigi Silvestri for the informal peer-review and tips.
We also want to thank Lambda and SF Compute for the free credits.

Constact

You can do so by sending an email to this address francesco215@live.it or by messaging on discord at sacco215

Citation

For attribution in academic contexts, please cite this work as

      Sacco, et al., "Oniris", zenodo, 2025
    

BibTeX citation

      @article{sacco2025Oniris
        author = {Sacco, Francesco and Peluso, Matteo},
        title= {Oniris},
        journal = {Zenodo},
        year = {2025},
        doi = {10.5281/zenodo.16927467},
        url = {francesco215.github.io/autoregressive_diffusion/}
      }