Seven years ago, the paper "Attention is all you need" introduced the transformer architecture, which revolutionized the entire field of deep learning.
Nowadays, all major models are based on the transformer architecture, but the internal working principle of the transformer remains an unsolved mystery.
Last year, one of the authors of the transformer paper, Llion Jones, announced the establishment of the artificial intelligence company Sakana AI. Recently, Sakana AI published a paper titled "Transformer Layers as Painters," exploring the flow of information in pre-trained transformers and conducting a series of experiments on transformer models with only decoders and only encoders frozen. Please note that the study did not perform any type of fine-tuning on the pre-trained models.
Paper link:
The study suggests that the internal mechanism of the transformer (especially the intermediate layers) can be understood by analogy to the production line of a painter.
The production line usually passes the canvas (input) to a series of painters. Some painters are good at painting birds, while others are good at painting wheels. Each painter receives the canvas from the next painter, and then decides whether to add some strokes to the painting or just pass it to the painter above (using residual connections).
Advertisement
This analogy is not a strict theory, but a tool for thinking about transformer layers. Inspired by this analogy, the study tested and verified some hypotheses:
Are all layers using the same representational space?
Are all layers necessary?Do all intermediate layers perform the same function?
Is the order of the layers important?
Can these layers run in parallel?
For some tasks, is the order more important than other factors?
Does looping help with the parallelism of layers?
Which variants have the least impact on model performance?
The study conducted a series of experiments on pre-trained Large Language Models (LLMs), which included testing variations in standard transformer execution strategies and measuring the impact of these variations on model performance across various benchmarks of decoder-only (Llama) and encoder-only (BERT) models.
Do all layers use the same representational space?
To answer whether different layers use the same representational space, the authors tested the robustness of the Transformer when skipping specific layers or switching the order of adjacent layers. For example, in Llama2-7B, the 6th layer is typically expected to receive the output from the 5th layer. Would it exhibit 'catastrophic' behavior if given the output from the 4th layer instead?
In Figure 2, we can see that, aside from the first and the last few layers, the layers of Llama2-7B are quite robust to layer skipping or layer switching.The experiment indicates that the intermediate layers share a common representational space and have a distinct representational space compared to the "peripheral layers" (the first layer and the last few layers). To further verify this hypothesis, the authors, following previous studies, measured the average cosine similarity between the hidden state activations of different layers in the benchmark models (Llama2-7B, Llama2-13B, and BERT-Large). Figure 3 shows the consistency among all intermediate layers.
This suggests that the model may have three distinct representational spaces for the "beginning," "middle," and "end" layers. Answering question 1: Yes, the intermediate layers appear to share a common representational space.
Are all layers necessary?
To further test whether the representational space of the intermediate layers is truly shared (beyond having close cosine similarity), the study attempted "skipping layers," which means sending the output of the Nth layer directly to the input of the N + M layer (where M > 1), thereby "skipping" M - 1 layers, as shown in Figure 1a. The experiment was to see if the N + M layer could understand the activation of the Nth layer, even though it was only trained based on the input received from the N + M - 1 layer. Figure 4 shows that Llama2-7B and BERT-Large experienced a moderate performance decline on many benchmarks. Answering question 2, are all layers necessary:
No, at least some intermediate layers can be removed without catastrophic failure.
Do all intermediate layers perform the same function?
If the intermediate layers share a common representational space, does this mean that the intermediate layers are redundant beyond this? To test this, the researchers re-ran the "skip" experiment from the previous subsection, replacing the weights of the intermediate layers with the weights of the central layers, effectively cycling T - 2N + 1 times on each replaced layer, where T is the total number of layers (32 layers for Llama2-7B, 24 layers for BERT-Large).As shown in Figure 5, it can be observed that as the number of replaced layers increases, the model's score on the benchmark test declines rapidly. From Figure 11 in the subsequent text, it appears that this method of replacing layers is worse than other methods attempted by researchers. Therefore, researchers conclude that intermediate layers perform different functions, and it is not feasible to share weights between intermediate layers.
Is the order of layers important?
Previous experiments have shown that intermediate layers share a representational space but are responsible for different functions within that space. The next question to address is the significance of the order of these functions. To address this question, researchers designed two sets of experiments. First, the intermediate layers were run in the reverse order from training. Specifically, the output of the T - N layer was taken and input into the T - N - 1 layer, then the output of this layer was input into the T - N - 2 layer, and so on, until the N layer, and then the output of this layer was sent to the subsequent T - N layers. In the second set of experiments, researchers ran the intermediate layers in a random order and took the average over 10 seed values.
Figures 6 and 7 show the results of running the intermediate layers in reverse and random order, respectively. The model showed a gradual decline in all basic test sets, indicating that although the order of layers is somewhat important to the model, these layers can still function even if the order is changed.
More interestingly, random shuffling of the layer order performed better than completely reversing it. This may be because the random shuffle retains some of the original relationships between layers to some extent (i.e., layer i is after layer j, where i > j), while completely reversing it breaks these relationships entirely.
Can these layers be run in parallel?
To verify that the layers themselves are more important than the order in which they are executed, researchers designed an experiment to run the intermediate layers in parallel, sending their average results to the final N layer.
As shown in Figure 8, the model's performance in all benchmark tests showed a gentle downward trend. However, this trend does not apply to the math application problems in GSM8K.The experimental results show that in most cases, this method is effective, but it does not handle some complex mathematical problems very well. This parallel processing method is more effective than simply skipping some layers, but it is not as outstanding as running the layers in reverse order. Based on this, researchers conclude that parallel running of layers is feasible in general, but for mathematical problems that require sequential logical understanding, this method may not be suitable.
For some tasks, is the order more important than other factors?
For most models that have been "modified," they tend to show the steepest decline when facing abstract reasoning (ARC) or mathematical reasoning (GSM8K) benchmark tests. This phenomenon may stem from the fact that the sensitivity of step-by-step reasoning tasks to the order of model layers is much higher than that of common sense tasks that mainly rely on semantic understanding. Unlike tasks that can be completed by understanding semantics alone, reasoning tasks require the model to grasp both structure and meaning at the same time. This observation is consistent with the assumption that the model may perform a certain degree of order-dependent reasoning in a single processing process.
Researchers used a metaphor to illustrate: if you are drawing a collage composed of many different elements, the order of drawing may not be so important; but if you are drawing an accurate architectural scene, the order of each stroke becomes very important. Based on this, researchers concluded that mathematical and reasoning tasks have a higher dependency on the order of model layers, while the impact of order is relatively smaller for tasks that mainly rely on semantic understanding.
Does the loop help the parallelism between layers?
Continuing the metaphor of drawing in the previous section, when a painter paints a picture, they don't start by painting everything at once, but first paint a part, such as the car body, and then add other things based on this part, such as wheels. In AI models, layers are the so-called painters, and processing information is like drawing. If the correct information is obtained first, the so-called car body is drawn first, then they can better complete their work and add wheels to the painting.
For transformers, when given appropriate input, layers may only contribute in the forward propagation, not by "passing" the input through residual connections. If this is indeed the case, then iterating the parallel layers in the previous experiment should improve the model's performance more than executing the parallel layers once. Based on this, researchers tested this by feeding back the average output of the parallel layers into the same layer for a fixed number of iterations.
Figure 9 shows the results of iterating the parallel layers 3 times. The results of iterating the parallel layers 3 times are significantly better than a single iteration (parallel layers). When the initial layer N is set to 15 (for the Llama2-7B model) or 11 (for the BERT model), that is, at the far left end of each case, only a single layer level is affected. In this specific case, the effect of three iterations of parallel is equivalent to simply repeating the middle layer three times. At the same time, for the parallel layer at this point, its performance is identical to that of the complete model.Researchers also replicated the same experiment for different numbers of iterations. Figure 10 illustrates the performance of Llama2-7B as it changes with the number of parallelized layers M and the number of iterations. The highest performance iteration for each M is highlighted with a red frame. Except for M=29 and M=31 (which almost parallelize all layers), the optimal number of iterations is roughly proportional to the number of parallelized layers. Therefore, the researchers concluded that the optimal number of iterations is directly proportional to the number of parallelized layers.
How to adjust the layers to minimize the impact on model performance?
Finally, in Figure 11, researchers compared all the 'modifications' to the Transformer in all experiments, displaying the median or average performance of all benchmarks in one chart.
Intermediate repetition —— replacing the intermediate layers with the same number of intermediate layer copies —— performed the worst, quickly dropping to the performance of a random baseline. In contrast, the impact of circular parallelism and random layer order was minimal. Therefore, the researchers concluded that the impact of repeating a single layer is the most severe. The impact of randomizing layer order and circular parallelism is minimal.
These experiments overall showed a gradual performance decline, but researchers are still unclear why these layers can still maintain a certain robustness under most perturbations, and this issue needs further exploration in future research.
For more details, please refer to the original paper.
Comment