Author: shayk

Diffusion large language models (dLLMs)
Almost every field in ML has been affected by the advance of LLMs, and while there are constant improvements in all the free and commercial models, they follow a similar process for generating their results. Diffusion large language models (dLLM) are a new paradigm for token generation that is promising enough to be considered as a new approach for building large language models that can be cheaper and potentially better for tasks like code generation. Let’s have a look at the diffusion large language models and how they work.

The current stack of decoding in LLMs

Transformers are the building block of LLMs despite all the improvements and changes that have been achieved in recent years. Transformers are made up of 2 main components: encoder and decoder. Most of the existing LLMs are only built using the decoder component, and that’s why they are sometimes referred to as decoder-only transformers.

The process of decoding in this approach is fully causal and auto-regressive. meaning that the decoder will start generating tokens strictly in a left-to-right manner and for generating the next token needs to digest all the previously generated tokens as well.

If you think about it, this actually makes sense and natural language is causal and choosing the next word depends on the previously picked words.

The following illustration shows what the process of decoding looks like in the transformers:

A simple visualization of how AR decoding works. At each step the tokens that are not Masked <m> are fed to the model to predict the next token, and the generated token is used for predicting the next token consequently until an <end> token is generated

An introduction to diffusion and discrete diffusion

Before getting into the diffusion large language models, we need to have a basic understanding of what diffusion means and how it can be used for discrete structures like languages.

Continuous diffusion

Diffusion models were first introduced for image generation tasks, in which the model was trained to be able to generate an image from pure random Gaussian noise. The first impressive image generation models like DALL-E were based on diffusion generation as well.

In this approach, the goal is to train the model to learn how to de-noise any given data in a specific domain that it is trained on. To achieve this, we gradually add noise to the input image in T time-steps and then in the decoding process, teach the model to remove the noise while being aware of the timestep, so the model will try to guess and remove the noise that was added in a specific timestep to the image.

We can write this process using mathematics as well. For the forward nosing process we have:

$$ q(x_{1:T}|x_0) = \prod^{T}_{t=1}q(x_t|x_{t-1}) $$

This is describing the process of adding noise to the original x₀ data over T time-steps in which each time-step is only directly depending on the previous time-step.

For the backward de-noising process, we need to calculate the joint probability of entire noising sequence from x₀ to x_T:

$$ p_{\Theta}(x_{0:T}) = p_{\Theta}(x_T)\prod^T_{t = 1} p_{\Theta}(x_{t-1} | x_t) $$

This backward process is parameterized and is learned during training.

This process can generate a valid sample that is close enough to the domain that the model was trained on, for example if the training dataset are all images of cars, by giving a pure noise to this model,it will generate an image of a car that doesn’t necessary exist in the training set, but is close enough to the ones that exist in it.

Discrete diffusion

Discrete diffusion is a little different from the normal diffusion, we have a discrete set of items that need to be masked in consecutive time-steps and then be unmasked to form a valid construct again, like building a sentence with a set of given words.

The general process is still the same, we have an input sequence that should be masked, and for masking we need to define a special masking token, and at each timestep, we randomly replace some tokens with the masking token.

To show the forward process using mathematical notations we have:

$$ q(x_t | x_{t-1}) = Cat(x_t;Q^T_t x_{t-1}) $$

Which is showing that the corrupted sequence of time-step t is based on the sequence in its previous step, and we’re using a transition matrix Q, to decide which tokens in the input sequence to be replaced by the masking token. The transition matrix itself is defined as following:

$$ Q_t = (1 – \beta_t)I + \beta_t1m^T $$

By increasing the value of beta at each timestep, more items in the sequence will be replaced by the mask token.

Diffusion large language models

Diffusion large language models are LLMs that follow the diffusion approach for generating their outputs. This means the model can behave non auto-regressive and generate tokens in different orders.

Despite the causal nature of languages, allowing the model to be non auto-regressive has the added benefit of increasing the speed of generation and completing the generation in fewer timesteps than are required for the full auto-regressive generation.

Diffusion models are also presumed to be able to do Global context planning that could improve the performance of the model in tasks like code generation.

Process of generating a text based on a prompt in DREAM, a diffusion large language model

Training dLLMs

To train an LLM, we can take an already trained LLM and gradually adapt it to perform the diffusion task, or we can just train a model from scratch to perform the diffusion.

Most of the existing dLLMs are following the adaptation path and are based on a well known LLM like Qwen.

The auto-regressive decoding step in LLMs only predicts a single next token which means the loss we try to minimize can be defined like this:

$$ L_{AR} = – \sum_{i = 0}^N log(x_i | x \leq i) $$

This loss function is focusing on only predicting a single token right, in the adaptation process, we start with a per-trained LLM, pick a dataset, mask the tokens in consecutive time steps and expect the model to un-mask the tokens, while being aware of the timestep. The model is expected to generate the correct tokens regardless of the position of the token in the sequence.

Models like Dream and DiffuCoder have a separate training phase after adaptation pre-training called annealing in which datasets with fewer tokens but with higher quality are used to improve the models capability on general tasks like speaking or generating code.

And finally we get to the instruction-tuning step where the model is taught to perform instructions like question-answering and be guided to generate answers that are compliant with human preferences.

This is a representation of the training process of DiffuCoder, a dLLM design specifically for code generation tasks.

Training stages of DiffuCoder: 1. Adaptation pretrainnig, 2. Annealing 3. Instruction Tuning, 4. RL training with coupled-GRPO (Image from the original paper)

Since language generation falls into the category of discrete diffusion, after each pass of diffusion decoding, we need to pick 1 or more of the predicted tokens with the highest probability and mask the rest of the predicted values again to be processed again.

Auto-regressiveness in dLLMs

Human language is inherently auto-regressive, meaning that picking a word in a sequence should change the probability of the next words that can be used to generate a correct sequence.

While being auto-regressive is not the desired outcome we expect in diffusion large language models, it is commonly observed that if models generate their responses semi-autoregressive (for example generating blocks of text in the left to right order) the generated response has a better quality.

To study the auto-regressiveness of a model we need to have a measure for it, and the measures of local and global AR-ness is introduced by the researchers of DiffuCoder.
- The local AR-ness is defined as: Consecutive next-token prediction pattern:
  - For a sequence of length K, we measure how many of the generated tokens follow the pattern of next-token prediction. It is obvious that by increasing the length of the sequence, the local AR-ness will decay
- The global AR-ness is defined as: Earliest mask selection
  - We specify a length of K in the sequence and count how many of the unmask tokens are falling into the range of those first K tokens. if we pick a bigger K, the global AR-ness will grow.
By specifying this metrics for AR-Ness we can study how the AR-Ness changes during the training stages and if the model shows different AR-Ness with different types of tasks (for example mathematics and coding).

The published results in the DiffuCoder paper is interesting as it points out that the model shows an increased amount of AR-Ness:
- after the annealing training phase and when exposed to higher quality data
- model has lower global AR-Ness when generating code compared to when answering mathematical questions, the model usually generates some of the late tokens first before unmasking the early tokens when generating code.
- The global AR-Ness has a minor fall after the RL training for preference alignment and instruction tuning.
dLLMs for code generation

One interesting aspect of using dLLMs for code generation is the concept of global planning in diffusion models, in contrast to the AR LLMs, dLLMs don’t have to start from a given point to generate the code, they can go back and forth between the functions and classes and even change some of the early tokens based on a token in one of the late positions given the bi-directional attention, which seems closer to how the process of coding is.

None of the existing dLLMs aren’t big enough to beat the commercial AR LLMs, but the comparisons between the small dLLMs and LLMs on coding benchmark datasets shows that there is hope for a new paradigm to dominate the architecture of LLMs in the near future.

Conclusion
- dLLMs are a relatively new approach for generating text that are inspired by the concept of discrete diffusion.
- They can to some level leave the traditional Auto-Regressive text generation of LLMs behind and generate texts starting from arbitrary positions in the final output.
- dLLMs can be used to generate texts with a much lower cost, since they can unmask multiple tokens at each pass and increase the generation speed by a factor of 2 and possibly even more.
- Given that dLLMs can have bi-directional attention and global planning during the decoding phase, tokens generated at a late position can affect and change the tokens that should be generated at earlier positions which seems to be more useful for code generation tasks.
Resources
- DREAM
- DIFFUCODER
July 25, 2025
A brief look at FastVLM
What are VLMs?

Vision language models, or for short (VLMs), are multi-modal models that allow a (large) language model to understand images and perform tasks that have an image-based input. For example, asking the model how many dogs are in this picture and sending a picture along with the text to the model to be processed. When we add the capability of understanding other forms of input to a Large Language Model (LLM), we call them Large Multi-Modal Models (LMM).

How VLMs work?

Disclaimer: this image is not representing my artwork skills.

Visual language models usually consist of a vision encoder backbone model, a projection layer, and an LLM. The goal of the vision encoder is to compute a set of representative tokens from the input images and is usually trained on an image dataset separately.

The projector layer is supposed to take the computed tokens from the image backbone and project them to the LLM encoding space. In other words, it applies some transformations on image tokens to make them meaningful to the LLM.

The role of the LLM is pretty clear as well. It takes the projected image tokens as well as the raw text embedding and is supposed to generate relative results in return.

The process of feeding the image tokens to the LLM is also interesting. In most current works right now, the image tokens are fused with the text tokens at some intermediate layer inside the LLM, but there are some new models that take the image tokens or even raw images at the same time as the text, but they are still not performing well enough to completely replace the current methods of token fusion.

The training process of VLMs is also interesting. If we consider that we have a trained vision backbone and a trained LLM, the first step is to train the Projector layer to learn how to convert the image tokens to language tokens. BLIP-2 is the original paper that was published with this idea.

Then the next stage of the training includes fine-tuning all the modules, including the vision backbone and the LLM, on a visual instruction dataset. These datasets are similar to QA datasets, but with the additional image for each training record.

The following images are showing which modules are trained at each stage of training with green color:

VLM stage 1 training

VLM stage 2 training

Alright, now that we have a general understanding of what VLMs are and how they work, let’s see what the main problem that this paper is going to solve is.

The tradeoff of image resolution/accuracy-latency

It is shown in previous works that increasing the resolution of the input image has a direct positive effect on the performance of the model in terms of accuracy. But as always, there is a trade-off to consider here. When the image resolution is increased, the vision backbone models will probably need a longer time to compute the tokens for the image. The number of computed tokens will increase, and as a result, the LLM will require a longer time to process all the input tokens to generate the output. The time that it takes for a Feed-Forward flow to complete in deep learning models is called the pre-filling time, and the time it takes for the LLM to generate the first output token is considered an efficiency metric called Time to first token or TTFT in short. It is obvious that if we have a VLM that takes a longer time to process the input image and generates more image tokens, that in turn will cause an increase in the pre-filling time of the LLM, it will have a longer TTFT, which is not a good thing in general.

By combining the TTFT metric with the accuracy, we can plot a curve that can depict the tradeoff that happens between accuracy and TTFT when the input image resolution increases. This is called a Pareto-optimal curve.

So, is there a way to improve the accuracy of the model by using high-resolution images while keeping the TTFT metric relatively low?

Well, according to what we said, there are 3 main reasons that the TTFT increases:
1. The vision backbone requires a longer time to compute the tokens
2. The vision backbone generates more tokens when the image resolution increases which increases the pre-filling time of the LLM
3. The time that it takes for the LLM to take the input and generate the first output token.
In this work, Apple researchers are trying to move towards a better solution by tackling the first 2 reasons and introducing an improved vision encoder as the backbone that is both faster than the existing image encoders and generates fewer tokens per image.

Swapping the image encoder of the VLM

This paper is actually continuing the work that the same team of researchers conducted in the FastViT paper, which is a brand-new architecture for image encoding. We’ll look at the main architecture and innovations of FastViT here as well, but the first thing that the authors try here is to place FastViT, which is a hybrid convolutional-transformer-based architecture that is already faster than the family of vision transformers, and as it is reported and you can see in the results table shown below, FastViT is much faster and creates fewer tokens for the input images with the same image resolution. It does, however, show a lower average accuracy score for the same input image resolution, but with an increase in the image resolution, we can see that FastViT is still much faster (about 4 times) and achieves a better average accuracy.

Comparison results between FastViT and ViT-L/14 on benchmark datasets

Now that we saw FastViT is a faster image backbone that can achieve the same accuracy in VLMs as the existing image backbone models, let’s take a look at its architecture and what makes it a good and efficient image backbone.

FastViT image encoder architecture

The main layers are abstracted and are shown on the left side of this image taken from the original paper. Then, each layer is shown in more detail on the right side of the image with the same corresponding colors.

The first layer called the stem is our input layer that receives the original image and extracts low-level details (ex. the edges and lines, etc.) from the image and performs downsampling to reduce its dimensions and passes the feature maps to the next layer where the generated tokens are mixed using a brand-new module called RepMixer that is influenced by ConvMixer and is further processed in another convolutional feed-forward network to extract higher-level features.

The third layer is called Patch Embedding and it was introduced to image processing through vision transformer architectures where the input image is broken into patches and each patch is then projected into a vector that would act as the token for that part of the image. In convolutional patch embedding, we’re still breaking the image into patches, but use separate depth-wise convolutions for processing each patch which acts as a projection layer, but requires fewer parameters and retains the spatial information present in the image. We’re also downsampling the feature maps in this layer as well to only pass high-level information to the next layers and reduce the computations.

The stack of feature extraction from layer 2 and patch embedding can be repeated multiple times and is repeated 3 times here, before moving to the last feature extraction layer which is similar to what we have in layer 2, but with the addition of using a self-attention module instead of RepMixer for mixing the computed tokens.

The final layers are the standard layers you can find in any image classification model that flatten the computed feature maps and apply a fully connected MLP to classify the object in the image.

Now let’s review some important decisions that make this model more mobile-friendly and efficient:

Wide use of Depth-wise Convolutions

This isn’t the right place to go into the details of how convolutions work, but you can think of it as a limited receptive field that is used to process a part of an image. (Think: seeing a room in the dark with a flashlight that only lights up a single part at a time.)

Normal convolution layers look like this:

A normal convolution block

At the top we have an input image with 3 channels (like RGB), and we’re applying a convolution kernel of size 3×3 to create a feature map. As you can see, this normal convolution is converting 9 pixels into 1 pixel in the next feature map. This means if we want to have 20 feature maps for the next level in the network, we need to apply 20 separate kernels and stack the output feature maps to reach the desired output channel. The total number of parameters that this adds to our model can be computed as input-ch * output-ch * kernel-width * kernel-height. In this example, the number of parameters would amount to 540 parameters (the numbers will scale very fast in deep nets).

Depthwise convolutions and Depthwise separable convolutions are attempts at making the convolution operation more efficient and they are widely used in the family of MobileNet and EfficientNet models.

Let’s take a look at what Depthwise separable convolution looks like:

Depth-wise separable convolution

As you can see, the trick is to apply a separate kernel with a channel size of 1 to each channel in the feature map to produce the next feature map, and then to collapse the intermediate feature maps computed from each channel and effectively get the same result as we would from the normal convolution, we apply a point-wise convolution to convert them to a single channel. The total number of parameters for depth separable convolutions are computed as: depthwise params + pointwise params = k * k * input-ch + 1 * 1 * input-ch * output-ch. so in a similar setup as above, we’d have: 27 + 30 = 57 learnable parameters.

Convolutional token mixer for early stages

One of the important contributions of the FastViT paper is introducing RepMixer for token mixing, which is a convolutional token mixer that requires a lot fewer parameters when compared to self-attention and is more efficient than ConvMixer due to re-parameterization at inference time.

Let’s talk a little bit about token mixing and what it means:

As you know our input image is broken up into separate parts during processing and we’re calling them a Patch, each patch is computed separately from a grid in the image and only includes the information about that part of the image, like a single puzzle piece, in token mixing, the main goal is to add more global context information from the other puzzle pieces to each patch.

In FastViT, a convolutional token mixer is used in the early stages due to computational efficiency, followed by a single self-attention token mixer in the last stage.

Training time over-parameterization and re-parameterizing at inference

As mentioned earlier, FastViT is heavily using DepthWise convolutions that have fewer parameters when compared to standard convolution operations. Having fewer parameters is good if we can reach the same accuracy/performance, but on paper, it generally means a model with fewer parameters has less capacity to learn. To overcome this potential limitation, a technique is used called training over-parameterization where the number of parameters is more than the parameters that are used during inference. This is depicted in the FastViT visual architecture as well.

Making some improvements to the FastViT model

Let’s get back to the main goal of this paper: Improving the Pareto-Optimal curve for Accuracy Vs TTFT in VLMs.
We saw that by using FastViT as the image encoder in VLMs, we get an instant boost in the latency and can even match the accuracy of the leading models while being more efficient.

By applying some minor changes to the FastViT model, the authors are introducing a new model called FastViTHD, that has the following architecture:

FastViTHD architecture

Let’s go over the changes in this architecture compared to the original FastViT:

Using 2 self-attention token mixing stages

There are 2 main reasons for adding an extra stage with self-attention to the architecture: Increasing the scale of the image encoder has a direct impact on improving the generalizability of the model; self-attention blocks are able to generate better enriched tokens by using all the available tokens and performing pairwise dot products to generate the next tokens.
But there is the problem with having more parameters that makes it sub-optimal to just add a new self-attention layer to the model; that’s why this issue is mitigated by adding another layer of downsampling to the architecture:

Downsampling by a factor of 32 instead of 16

Adding a new layer of downsampling to the architecture means that the final self-attention layer needs to perform fewer computations, and since we have a smaller feature map by the last downsampling, we’re going to have fewer tokens by the end, that are likely to include the same amount of information as before because of having another layer of self-attention.

Multi-Scale Features

Another change in the architecture is using the features from different layers and different scales as the output of the Vision Encoders instead of just picking the features from the penultimate layer.
Features at different levels have different granularity and usually include complementary information that, when aggregated, can lead to a better overall performance.

At last let’s take a look at the final architecture of FastVLM with FastViTHD as the main vision encoder:

FastVLM full architecture

Conclusion

We looked at the main parts of a new vision encoder that was introduced by Apple that tries to improve the Pareto-optimal curve of accuracy over time to the first token in vision language models, and as you can see in the presented results, when combined with the Qwen LLM family, FastViTHD has a much lower TTFT while being able to show good performance as well.

Pareto-Optimal curve for accuracy over TTFT between VLMs based on FastViT and FastViTHD

Sources
- FastVLM
- FastViT
- https://medium.com/@zurister/depth-wise-convolution-and-depth-wise-separable-convolution-37346565d4ec
June 28, 2025
My main intention for writing this blog

These past few days, we’ve been caught up in a war (yes, a literal one), and it’s been really hard for me to bring myself to do anything day-in and day-out except watch the news helplessly and worry about my friends and family over and over again.

I thought maybe starting a —mostly— technical writing blog that was on my mind for a while would be a good idea as a distraction, and who knows, maybe it’ll have some value as well.

I’m planning to write mostly about the web platform, general software engineering concepts, and the current state of AI —Augmented Intelligence— (it’s not just Apple who gets to change the acronyms). It’ll probably be things that you already know, and I don’t want to make an impression of being an `expert` in any of these areas, but maybe we could make the ride a little more interesting!

June 15, 2025