^{1}

^{2}

^{2}

The authors have declared that no competing interests exist.

How can we effectively regularize BERT? Although BERT proves its effectiveness in various NLP tasks, it often overfits when there are only a small number of training instances. A promising direction to regularize BERT is based on pruning its attention heads with a proxy score for head importance. However, these methods are usually suboptimal since they resort to arbitrarily determined numbers of attention heads to be pruned and do not directly aim for the performance enhancement. In order to overcome such a limitation, we propose AUBER, an automated BERT regularization method, that leverages reinforcement learning to automatically prune the proper attention heads from BERT. We also minimize the model complexity and the action search space by proposing a low-dimensional state representation and dually-greedy approach for training. Experimental results show that AUBER outperforms existing pruning methods by achieving up to 9.58% better performance. In addition, the ablation study demonstrates the effectiveness of design choices for AUBER.

How can we effectively regularize BERT (Bidirectional Encoder Representations from Transformers) [

Despite its recent success and wide adoption, fine-tuning BERT on a downstream task is prone to overfitting due to overparameterization; BERT-base has 110M parameters and BERT-large has 340M parameters. The overfitting worsens when the target downstream task has only a small number of training examples. [

To mitigate this critical issue, multiple studies attempt to regularize BERT by pruning parameters or using dropout to decrease its model complexity [

In this paper, we propose AUBER, an effective method for regularizing BERT. AUBER overcomes the limitation of past attempts to prune attention heads from BERT by leveraging reinforcement learning. When pruning attention heads from BERT, our method automates this process by learning policies rather than relying on a rule-based policy and heuristics. Thanks to the automation, AUBER does not require us to predetermine any of the key parameters, such as the number of the attention heads to be pruned. AUBER prunes BERT sequentially in a layer-wise manner so as to avoid prohibitively large search space. For each layer, AUBER extracts features that represent the state of the layer and feeds the features to the reinforcement learning agent to determine which attention head to prune from the layer. Among numerous methods to represent the state, AUBER effectively summarizes the state of the layer into a low-dimensional vector for the sake of the scalability of the reinforcement learning agent. The final pruning policy found by the reinforcement learning agent is used to prune the corresponding layer. Before AUBER proceeds to process the next layer, BERT is fine-tuned to recapture the information lost due to pruning attention heads.

AUBER successfully regularizes BERT model, enhancing the model performance up to 9.58%. AUBER provides the best performance among the state-of-the-art BERT attention head pruning methods.

The figure shows the transition from

In the rest of this paper, we first introduce the related works and preliminaries. Then, we describe our proposed method and experimentally evaluate the performance of AUBER and its competitors. The code for AUBER can be found in

To prevent overfitting of BERT on downstream NLP tasks, various regularization techniques have been proposed. Variants of dropout improve the stability of fine-tuning large pre-trained language models even when presented with a small number of training examples [

A number of studies have analyzed the effectiveness of pruning parameters in BERT. [

Thanks to the unique structure of BERT that consists of multi-headed attention, studies on the attention heads [

To automate the process of Convolutional Neural Network pruning, [

An attention function [^{Q}^{Q}, ^{K}^{K}, ^{V}^{V}, where ^{Q}, ^{K}, and ^{V} as

In multi-headed attention, _{i}(^{Q}, ^{K}, ^{V}) is the output of the attention function with

A self-attention function follows the same mapping methods as a general attention function except that all the query, key, and value embeddings come from the same sequence. Likewise, multi-headed self-attention is a multi-headed attention function that takes the input embeddings from a common sequence.

BERT [

BERT-base has 12 layers of Transformer encoder blocks and each layer has 12 self-attention heads; there is a total of 144 self-attention heads in BERT-base. Despite its success in various NLP tasks, BERT sometimes overfits when the training dataset is small due to overparameterization. Thus, there has been a growing interest in BERT regularization through various methods such as dropout [

Deep Q Network (DQN) [_{s}-dimensional state space to a _{a}-dimensional action space. Here, value is the expectation of the total rewards under the consideration of a decaying factor.

Two important features of the DQN algorithm are target network and experience replay. The target network has the same architecture as that of the policy network, and its parameters are copied every

Experience replay introduces first-in-first-out memory buffer, replay memory, in order to resolve the existing limitations. Without the memory buffer, the training samples are obtained based only on the current state; therefore the samples have a strong correlation with each other and are dominated by the optimal action. Experience replay stores transition tuples (i.e. (state, action, reward, next state)) in the replay memory continuously, and a mini-batch randomly sampled from the memory updates the parameters of the policy network. This eliminates the detrimental correlation among the training samples and increases data efficiency by allowing each training sample to contribute to multiple parameter updates.

We propose AUBER, our method for automatically regularizing BERT by learning the best strategy to prune attention heads from BERT. After presenting the overview of the proposed method, we describe how we frame the problem of pruning attention heads into a reinforcement learning problem. Then, we explain how states are represented in AUBER and provide a justification for the process. The next section describes how AUBER reduces the extremely large search space.

We observe that BERT is prone to overfitting for tasks with a few training data. However, the existing head pruning methods rely on hand-crafted heuristics and hyperparameters, which give sub-optimal results. The goal of AUBER is to automate the pruning process for successful regularization. Designing such regularization method entails the following challenges:

We propose the following main ideas to address the challenges:

AUBER leverages reinforcement learning for efficient search of regularization strategy without relying on heuristics. We exploit DQN among various reinforcement learning frameworks that have shown superior performance in model-free and off-policy environments. The overall flow is described in

AUBER trains DQN to find out the attention heads that should be pruned for a better regularization following the illustrated steps.

There are numerous methods to summarize the input states for DQN by deploying query, key, or value matrices, which are independent of the input data. For example, a naive approach to directly use the whole query, key, and value matrices in the current layer can represent the state of the layer. However, it gives complicated and high-dimensional state representations which result in prohibitively large DQN. Thus, we aim to obtain a concise but effective state representation and reduce the number of parameters in DQN.

Each layer of BERT has multiple attention heads, each of which has its own query, key, and value matrices. For layer _{l} using L1 norm of the value matrix of each attention head. Further details for this computation is elaborated in the next section.

The action space of AUBER is discrete. For a BERT model with ^{th} attention head is pruned. The action

Here, _{action} is the total number of actions taken by the agent up to the current episode, _{initial} is the starting value of _{action} = 0), _{final} is the value that _{action} → ∞, and _{decay} is a hyperparameter that adjusts the rate of decay of

After the ^{th} head is pruned, the value of ^{th} index of _{l}, the

The reward of AUBER is the change in performance,

To evaluate the reward, the training data are split into two sets: mini-training set and mini-dev set. We use the mini-dev set for the reward evaluation and the mini-training set for the fine-tuning, which will be described in the next paragraph.

If we set the reward simply as

After the best pruning policy for layer

In order to make DQN scalable, we summarize the state of each layer into an _{l} of layer _{l}. The justification of using the L1 norm of the value matrix is given by Theorem 1 which states that the L1 norm of the value matrix of a head upper bounds the L1 norm of its output matrix, which implies the importance of the head in the layer.

^{th}_{i} ^{th}^{V}‖_{1} _{1} = ∑_{j} ∑_{k} |_{jk}|.

^{th} head in the layer, let

The output of the ^{th} head, _{i}, is evaluated as _{i} = _{i} _{i}. Then,

Since the L1 norm of a vector is always greater than or equal to the L2 norm of the vector,

All heads in the same layer take the same ^{V} as input and ^{V}‖_{1}.

Theorem 1 implies that the importance of the ^{th} attention head in its layer is bounded by the L1 norm of the head’s value matrix

The total number of attention heads in BERT-base is 144 as it consists of 12 layers each of which has 12 attention heads. Naively designing actions would lead to 2^{144} possible actions which are prohibitively large. Our idea to reduce the search space is dually-greedy pruning: we prune layer-by-layer in a greedy manner (from lower to upper layers), and in each layer, we greedily prune one single attention head at a time.

For each layer _{l} which encodes useful characteristics (L1 norm of the value matrix in each attention head) of this layer. Then, the agent outputs the index of an attention head that is expected to increase the training performance when removed. After an attention head ^{th} index of _{l} is set to 0, and it is provided as the next state to the agent. This process is repeated until the action _{l+1} is calculated from the fine-tuned model.

Algorithm 1 illustrates the process of AUBER. AUBER receives a BERT model _{t}, which is fine-tuned on a specific task _{t}. Lines 2-30 are conducted in a layer-wise manner. In line 2, we initialize a policy network _{l}, prune an attention head based on the _{t} according to the policy, and finally fine-tune _{t}. After pruning a layer

_{t} fine-tuned on task

_{t}.

1

2 Initialize policy network

3

4

5

6

7

8

9

10

11 _{l})

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29 _{t}.

30 _{t}.

31

We conduct experiments to answer the following questions of AUBER.

We test AUBER on four GLUE datasets [

Dataset | # of classes | # of train | # of dev | Metrics |
---|---|---|---|---|

MRPC | 2 | 3668 | 408 | Accuracy |

CoLA | 2 | 8551 | 1043 | Matthews |

RTE | 2 | 2490 | 277 | Accuracy |

WNLI | 2 | 635 | 71 | Accuracy |

The URLs for the datasets are as follows: MRPC (

We use the pre-trained

AUBER gives the best performance for the same number of pruned heads. Bold font indicates the best performance among competing pruning methods.

MRPC | CoLA | RTE | WNLI | |
---|---|---|---|---|

Original | 84.07 | 57.01 | 63.54 | 46.48 |

AUBER | ||||

Random | 84.02±1.12 | 57.89±0.90 | 63.47±1.29 | 54.08±2.14 |

Confidence | 83.70±0.47 | 57.69±2.19 | 64.26±1.64 | 55.77±0.77 |

Michel et al. [ |
84.22±0.33 | 58.86±0.64 | 63.90±0.00 | 55.21±1.84 |

Voita et al. [ |
83.92±0.71 | 55.34±0.81 | 64.12±1.65 | 52.96±5.51 |

We use a 4-layer feedforward neural network for the DQN agent. The input dimension is 12 and the output dimension is 13. The dimension of all hidden layers is set to 512. LeakyReLU is applied after all layers except for the last one. We train the DQN agent for 150 episodes. For the epsilon greedy strategy to choose actions, the initial epsilon value _{initial} and final epsilon value _{final} are set to 1 and 0.05 respectively, and the epsilon decreases exponentially with the decay rate _{decay} of 256. The replay memory size is set to 5000, and the batch size for training the DQN agent is set to 128. The discount value

We compare AUBER with other methods that prune BERT’s attention heads. If AUBER prunes

We construct all models using the PyTorch framework. All the models are trained and tested on a GeForce GTX 1080 Ti GPU.

We evaluate the performance of AUBER against competitors. We repeat the experiments five times and report the average and the standard deviation of the performance.

We empirically demonstrate the effectiveness of our design choices for AUBER. More specifically, we validate that the

Comparison of AUBER with four variants: AUBER-Query, AUBER-Key, AUBER-L2, and AUBER-Reverse on four GLUE datasets to demonstrate the effectiveness of various ways to calculate the initial state. AUBER-Query and AUBER-Key use the query and key matrices respectively, and AUBER-L2 uses the L2 norm of the value matrix to obtain the initial state. AUBER-Reverse processes BERT starting from the final layer (e.g. 12^{th} layer for BERT-base). Bold font indicates the best performance among pruning methods.

MRPC | CoLA | RTE | WNLI | |
---|---|---|---|---|

AUBER | ||||

AUBER-Query | 83.87±0.84 | 55.81±0.84 | 65.05±1.06 | 47.61±5.12 |

AUBER-Key | 83.68±0.75 | 56.90±1.46 | 63.83±0.39 | 50.14±7.56 |

AUBER-L2 | 82.90±1.39 | 57.46±1.97 | 64.55±1.74 | 40.28±12.7 |

AUBER-Reverse | 84.56±1.39 | 58.07±1.27 | 62.24±1.43 | 43.67±8.15 |

Among the query, key, and value matrices of each attention head, we show that the value matrix best represents the current state of BERT. We evaluate the performance of AUBER against AUBER-Query and AUBER-Key. AUBER-Query and AUBER-Key use the query and key matrices respectively to obtain the initial state.

AUBER uses the L1 norm of the value matrices to compute the state vector based on the theoretical derivation. In this ablation study, we experimentally show that the L1 norm of the value matrices is appropriate for the state vector. We set a new variant AUBER-L2, which leverages the L2 norm of the value matrices to compute the initial state vector instead of the L1 norm. The performance of AUBER is far more superior than AUBER-L2 in most cases bolstering that the L1 norm of the value matrices effectively represents the state of BERT.

We empirically demonstrate how the order in which the layers are processed affects the final performance. We evaluate the performance of AUBER against AUBER-Reverse which processes BERT layers in the opposite direction (i.e. starting from the 12^{th} layer) to what AUBER does. As shown in

We visualize how the model performance changes as each layer is processed by AUBER and four competitors in ^{th} and the 10^{th} layer in which performance degradation has occurred. This proves that AUBER does not prune the very important heads that can bring significant performance drop when pruned.

AUBER consistently improves the model performance and achieves outstanding final performance, while all the other methods fail to enhance the model performance.

We propose AUBER, an effective method to regularize BERT by automatically pruning attention heads. Instead of depending on heuristics or rule-based policies, AUBER leverages reinforcement learning to learn a pruning policy that determines which attention heads should be pruned for better regularization. Experimental results demonstrate that AUBER effectively regularizes BERT, increasing the performance of the original model on the dev dataset by up to 9.58%. In addition, we experimentally demonstrate the effectiveness of our design choices for AUBER.