The authors have declared that no competing interests exist.

Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins.

Naturally, there have been many studies and attempts to improve the accuracy of KD. Sun et al. [

In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel KD method designed especially for Transformer-based models [

We apply SPS in order to increase the effective model capacity of the student model without increasing the number of parameters. SPS has two steps: 1) stacking layers that share parameters and 2) shuffling the parameters between shared pairs of layers. Doing so increases the model’s effective capacity which enables the student to better replicate the teacher model.

We apply a pretraining task called PTP for the student. Through PTP, the student model learns general knowledge about the teacher and the task. With this additional pretraining, the student more efficiently acquires and utilizes the teacher’s knowledge during the actual KD process.

Throughout the paper we use Pea-KD applied on BERT model (PeaBERT) as an example to investigate our proposed approach. We summarize our main contributions as follows:

The framework of first pretraining language models and then finetuning for downstream tasks has now become the industry standard for Natural Language Processing (NLP) models. Pretrained language models, such as BERT [

It is known that through pretraining using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), the attention matrices in BERT can capture substantial linguistic knowledge. BERT has achieved the state-of-the-art performance on a wide range of NLP tasks, such as the GLUE benchmark [

However, these modern pretrained models are very large in size and contain millions of parameters, making them nearly impossible to apply on edge devices with limited amount of resources.

As deep learning algorithms started getting adopted, implemented, and researched in diverse fields, high computation costs and memory shortage have started to become challenging factors. Especially in NLP, pretrained language models typically require a large set of parameters. This results in extensive cost of computation and memory. As such, Model Compression has now become an important task for deep learning. There have already been many attempts to tackle this problem, including quantization [

As briefly covered in the introduction, KD [

Sharing parameters across different layers is a widely used idea for model compression. There have been several attempts to apply parameter sharing in transformer architecture and BERT model. However, the existing parameter sharing methods exhibit a large tradeoff between model performance and model size. They reduce the model’s size significantly but also suffers from a great loss in performance as a result.

In the following, we provide an overview of the main challenges faced in KD and our methods to address them. We then discuss the precise procedures of SPS and PTP in detail. Lastly, we explain our final method, PeaBERT and the training details. The code for Pea-KD is available at

BERT-base model contains over 109 million parameters. Its extensive size makes model deployment often infeasible and computationally expensive in many cases, such as on mobile devices. As a result, industry practitioners commonly use a smaller version of BERT and apply KD. However, the existing KD methods face the following challenges:

We propose the following main ideas to address these challenges:

The following subsections describe the procedures of SPS, PTP, and PeaBERT in detail.

SPS improves student’s model capacity while using the same number of parameters, addressing the capacity limitations of a typical KD. SPS is composed of the following two steps.

In the first step, we double the number of layers in the student model. We then share the parameters between the bottom half and the upper half of the model, as graphically represented in

Each figure represents (a) first step of SPS (b) second step of SPS, and (c) modified SPS for a 6-layer student, respectively.

In the second step, we shuffle the Query and Key parameters between the shared pairs as shown in

SPS is depicted in

There can be several candidates for KD-specialized initialization. We propose a pretraining approach called PTP, and experimentally show that it improves KD accuracy.

Most of the previous studies on KD do not elaborate on the initialization of the student model. There are some studies that use a pretrained student model as an initial state, but those pretraining tasks are irrelevant to either the teacher model or the downstream task. To the best of our knowledge, our study is the first case that pretrains the student model with a task relevant to the teacher model and its downstream task. PTP consists of the following two steps.

In the first step, we generate PTP labels for our data based on the teacher’s predictions. We first input the training data in the teacher model and collect the teacher model’s predictions. We then define “confidence” as the following. We apply softmax function to the teacher model’s predictions, and the maximum value of the predictions is defined as the confidence. Next, with a specific threshold “t” (a hyperparameter between 0.5 and 1.0), we assign a new label to the training data according to the rules listed in

Teacher’s prediction is correct? | confidence > t? | PTP label |
---|---|---|

True | True | confidently correct |

True | False | unconfidently correct |

False | True | confidently wrong |

False | False | unconfidently wrong |

In the second step, using the artificial PTP labels we created, we now pretrain the student model to predict the PTP label when for a given input. We train the student model until convergence. Once these two steps are complete, we use this PTP-pretrained student model as the initial state for the KD process.

The motivation behind PTP is that it can serve as a prior before moving on to the actual KD process by allowing the student model to learn the high-level knowledge of the teacher’s predictions in advance. The core idea is to make PTP labels by explicitly expressing important high-level information from the teachers’ softmax outputs, such as whether the teacher model predicted correctly or how confident the teacher model is. Pretraining using these labels would help the student acquire the teacher’s generalized knowledge latent in the teacher’s softmax output. This pretraining makes the student model better prepared for the actual KD process. For example, if a teacher makes an incorrect prediction for a data instance x, then we know that x is generally a difficult one to predict. Since this knowledge is obtainable only by directly comparing the true label with the teacher’s output, it would be difficult for the student to acquire this in the conventional KD process. Representing this type of information through PTP labels and training the student to predict them could help the student acquire such deeply latent knowledge included in the teacher’s output much easily. Intuitively, we expect that a student model that has undergone such a pretraining session is better prepared for the actual KD process and will likely achieve better results.

PeaBERT applies SPS and PTP together on BERT for maximum impact on performance. Given a student model, PeaBERT first transforms it into an SPS model and applies PTP. Once PTP is completed, we use this model as the initial state of the student model for the KD process. The overall framework of PeaBERT is depicted in

The bottom left box represents applying SPS to the student model. The middle box illustrates applying PTP to the SPS-applied student model. The last box represents applying KD on the PTP-trained and SPS-applied student model. The final output is our PeaBERT model.

For the starting point of the KD process, a well-finetuned teacher model should be used. We use the 12 layer BERT-base model as the teacher. The learned parameters are denoted as:
^{t} denotes parameters of the teacher, _{i} denotes the training data, _{t} denotes the teacher model’s output predictions, _{i} denotes the true labels, and

We then pretrain the student model with PTP labels using the following loss:

The loss function is as follows (more details can be found in [^{k} denotes the output logits of the k-th layer.

Note that during the KD process, we use a softmax-temperature T, which controls the softness of teacher model’s output predictions introduced in [

We discuss experimental results to assess the effectiveness of our proposed method. Our goal is to answer the following questions.

We use four of the most widely used datasets in the General Language Understanding Evaluation (GLUE) benchmark [

We use Patient Knowledge Distillation (PatientKD [

We use the 12-layer original BERT model [^{−5}, number of epochs from {4, 6, 10},

We followed the official GLUE leaderboard’s metric for each dataset. We use accuracy for RTE, SST-2, and QNLI. For MRPC, we decided to use F1 score as our main metric following the previous works [

We summarize the performance of PeaBERT against the standard baseline PatientKD in

Model | RTE (Acc) | MRPC (F1) | SST-2 (Acc) | QNLI (Acc) | Avg |
---|---|---|---|---|---|

BERT_{1}-PatientKD |
52.8 | 80.6 | 83.6 | 64.0 | 70.3 |

PeaBERT_{1} |
53.0 | 81.0 | 86.9 | 78.8 | 75.0 |

BERT_{2}-PatientKD |
53.5 | 80.4 | 87.0 | 80.1 | 75.2 |

PeaBERT_{2} |
64.1 | 82.7 | 88.2 | 86.0 | 80.3 |

BERT_{3}-PatientKD |
58.4 | 81.9 | 88.4 | 85.0 | 78.4 |

PeaBERT_{3} |
64.5 | 85.0 | 90.4 | 87.0 | 81.7 |

The results are evaluated on the test set of GLUE official benchmark. The subscript numbers denote the number of independent layers of the student.

Model | # of params | RTE(Acc) | MRPC(F1) | SST-2(Acc) | QNLI(Acc) | Avg |
---|---|---|---|---|---|---|

DistilBERT | 42.6M | 59.9 | 87.5 | 91.3 | 89.2 | 82.0 |

TinyBERT | 42.6M | 70.4 | 90.6 | 93.0 | 91.1 | 86.3 |

BERT-of-Theseus | 42.6M | 69.0 | 91.4 | 91.5 | 89.9 | 85.5 |

PeaBERT | 42.6M | 73.6 | 92.9 | 93.5 | 90.3 | 87.6 |

The cited results of the competitors are from the official papers of each method. For accurate comparison, model dimensions are fixed to six layers across all models compared. The results are derived from GLUE development set.

First, we see from _{1}-QNLI. These results validate the effectiveness of PeaBERT across varying downstream tasks and student model sizes.

Second, using the same number of parameters, PeaBERT outperforms the state-of-the-art KD baselines DistilBERT, TinyBERT, and BERT-of-Theseus by 5.6%, 1.3%, and 2.1% on average. We use a 6-layer student model for this comparison. An advantage of PeaBERT is that it achieves remarkable accuracy improvement just by using the downstream dataset without touching the original pretraining tasks. Unlike its competitors DistilBERT and TinyBERT, PeaBERT does not touch the original pretraining tasks, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). This reduces training time significantly. For example, DistilBERT took approximately 90 hours with eight 16GB V100 GPUs, while PeaBERT took a minimum of one minute (PeaBERT_{1} with RTE) to a maximum of one hour (PeaBERT_{3} with QNLI) using just two NVIDIA T4 GPUs.

Finally, another advantage of PeaBERT is that it can be applied to other transformer-based models with minimal modifications. The SPS method can be directly applied to any transformer-based models, and the PTP method can be applied to any classification task.

We perform ablation studies to verify the effectiveness of SPS. We compare three models BERT_{3}, SPS-1, and SPS-2. BERT_{3} is the original BERT model with 3 layers, which applies none of the SPS steps. SPS-1 applies only the first SPS step (paired parameter sharing) to BERT_{3}. SPS-2 applies both the first step and the second step (shuffling) to BERT_{3}.

The results are summarized in _{3}, SPS-1 shows improved accuracy in all the downstream datasets with an average of 1.1%, verifying our first motivation. Comparing SPS-1 with SPS-2, we note that SPS-2 consistently shows even better performance with an average of 1.9%, which validates our second motivation. Based on these results, we conclude that both steps of the SPS process work as intended to increase the student model’s capacity without increasing the number of parameters used.

Model | # of params | RTE (Acc) | MRPC (F1) | SST-2 (Acc) | QNLI (Acc) | Avg |
---|---|---|---|---|---|---|

BERT_{3} |
21.3M | 61.4 | 84.3 | 89.4 | 84.8 | 80.0 |

BERT_{3}+SPS-1 |
21.3M | 63.5 | 85.8 | 89.6 | 85.5 | 81.1 |

BERT_{3}+SPS-2 |
21.3M | 68.6 | 86.8 | 90.2 | 86.5 | 83.0 |

The results are derived using GLUE dev set.

We perform an ablation study to validate the effectiveness of using PTP as an initial guide for the student model. We use BERT_{3}+SPS, which is SPS applied on BERT_{3}, as our base model. We compare the results of PTP to its variants PTP-1 and PTP-2. Note that PTP uses four labels constructed from two types of information latent in the teacher model’s softmax prediction: (1) whether the teacher predicted correctly and (2) how confident the teacher is. PTP-1 is a variant of PTP that uses only two labels that state whether the teacher predicts correctly or not. PTP-2 is another PTP variant that uses two labels to state whether the teacher’s predictions is confident or not. PTP-1 and PTP-2 contain different type of teacher’s information in their labels, respectively. From the results summarized in

Model | # of params | RTE (Acc) | MRPC (F1) | SST-2 (Acc) | QNLI (Acc) | Avg |
---|---|---|---|---|---|---|

BERT_{3}+SPS |
21.3M | 68.6 | 86.8 | 90.2 | 86.5 | 83.0 |

BERT_{3}+SPS+PTP-1 |
21.3M | 69.0 | 88.0 | 90.4 | 86.6 | 83.5 |

BERT_{3}+SPS+PTP-2 |
21.3M | 69.3 | 88.3 | 90.9 | 86.8 | 83.8 |

BERT_{3}+SPS+PTP |
21.3M | 70.8 | 88.7 | 91.2 | 87.1 | 84.5 |

The results are derived using GLUE dev set.

These results prove two things: (1) The student that goes through only the conventional KD process does not fully utilize the knowledge included in the teacher model’s softmax outputs, and (2) PTP does help the student better utilize the knowledge included in teacher’s softmax outputs. This firmly supports the efficacy of our PTP method and also validates our main claim that initializing a student model with KD-specialized method prior to applying KD can improve accuracy. As existing KD methods do not place much emphasis on the initialization process, this finding highlights a potentially major, undiscovered path to improving model accuracy. Further and deeper researches related to KD-specialized initialization could be promising.

In this paper, we propose Pea-KD, a new KD method for transformer-based distillation, and show its efficacy. Our goal is to address and reduce the limitations of the currently available KD methods: insufficient model capacity and absence of proper initial guide for the student. We first introduce SPS, a new parameter sharing approach that uses a shuffling mechanism, which enhances the capacity of the student model while using the same number of parameters. We then introduce PTP, a KD-specific initialization method for the student model. Our proposed PeaBERT comes from applying these two methods SPS and PTP on BERT. Through extensive experiments conducted using multiple datasets and varying model sizes, we show that our method improves KD accuracy by an average of 4.4% on the GLUE test set. We also show that PeaBERT works well across different datasets, and outperforms the original BERT as well as other state-of-the-art baselines on BERT distillation by an average of 3.0%.

One limitation of this work is that the proposed method is verified only on the BERT model. As a future work, we plan to apply our technique to other pretrained language models as well. We also plan to delve deeper into the concept of KD-specialized initialization of the student model. Also, since PTP and SPS are independent processes on their own, we plan to combine PTP and SPS with other model compression techniques, such as weight pruning and quantization.