PLoS ONEplosplosonePLOS ONE19326203Public Library of ScienceSan Francisco, CA USAPONED212763910.1371/journal.pone.0265621Research ArticleComputer and information sciencesNeural networksBiology and life sciencesNeuroscienceNeural networksBiology and life sciencesNeuroscienceCognitive scienceCognitive psychologyPerceptionSensory perceptionBiology and life sciencesPsychologyCognitive psychologyPerceptionSensory perceptionSocial sciencesPsychologyCognitive psychologyPerceptionSensory perceptionBiology and life sciencesNeuroscienceSensory perceptionBiology and life sciencesNeuroscienceCognitive scienceCognitive psychologyLanguageBiology and life sciencesPsychologyCognitive psychologyLanguageSocial sciencesPsychologyCognitive psychologyLanguageComputer and information sciencesInformation technologyNatural language processingResearch and analysis methodsSeparation processesDistillationComputer and information sciencesArtificial intelligenceMachine learningDeep learningBiology and life sciencesNeuroscienceCognitive scienceCognitive psychologyLearningHuman learningBiology and life sciencesPsychologyCognitive psychologyLearningHuman learningSocial sciencesPsychologyCognitive psychologyLearningHuman learningBiology and life sciencesNeuroscienceLearning and memoryLearningHuman learningPeople and placesPopulation groupingsProfessionsTeachersSensiMix: SensitivityAware 8bit index & 1bit value mixed precision quantization for BERT compressionSensitivityAware 8bit index & 1bit value mixed precision quantization for BERT compressionPiaoTairenConceptualizationData curationFormal analysisInvestigationMethodologyWriting – original draftChoIkhyunData curationFormal analysisInvestigationValidationWriting – review & editinghttps://orcid.org/0000000287746950KangU.ConceptualizationSupervisionWriting – original draft*Seoul National University, Seoul, Republic of KoreaConsoliSergioEditorEuropean Commission, ITALY
The authors have declared that no competing interests exist.
* Email: ukang@snu.ac.kr20221842022174e026562126820214320222022Piao et alThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Given a pretrained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pretraining language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time. In this paper, we propose SensiMix (SensitivityAware Mixed Precision Quantization), a novel quantizationbased BERT compression method that considers the sensitivity of different modules of BERT. SensiMix effectively applies 8bit index quantization and 1bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layerwise Finetuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNORCount GEMM for 8bit and 1bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that SensiMix compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8× and shrinking the inference time by around 80% without noticeable accuracy drop.
Institute of Information & Communications Technology Planning & Evaluation (IITP) of Korea government (MSIT)No.2020000894https://orcid.org/0000000287746950KangU.Institute of Information & Communications Technology Planning & Evaluation (IITP) of Korea government (MSIT)No.2017001772https://orcid.org/0000000287746950KangU.The Institute of Engineering Research at Seoul National Universityhttps://orcid.org/0000000287746950KangU.ICT at Seoul National Universityhttps://orcid.org/0000000287746950KangU.Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)2021002068https://orcid.org/0000000287746950KangU.Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)2021001343https://orcid.org/0000000287746950KangU.This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) ([grant number No.2020000894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [grant number No.2017001772, Development of QA systems for Video Story Understanding to pass the Video Turing Test], [grant number No.2021002068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)], and [grant number No.2021001343, Artificial Intelligence Graduate School Program (Seoul National University)]). The Institute of Engineering Research and ICT at Seoul National University provided research facilities for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Data AvailabilityThe data underlying this study have been uploaded to GitHub and are accessible using the following link: https://github.com/snudatalab/SensiMix.Introduction
Given a pretrained BERT, how can we compress it to a fast and lightweight one? Pretraining language models, such as BERT [1], RoBERTa [2], and ERNIE 2.0 [3], have been shown to be effective for improving many natural language processing (NLP) tasks, such as language inference, named entity recognition, and question answering.
However, these models usually have an extremely large number of parameters, which leads to high training cost and long inference time. For example, BERTbase [1] has 12 layers and 110 million parameters, and training BERTbase from scratch typically takes about four days on 4 to 16 Cloud TPUs. Even finetuning on downstream tasks may take several hours to finish on a typical GPU like NVIDIA RTX 2080Ti. Moreover, Kovaleva et al. [4] demonstrate there is redundancy in BERT. Therefore, it is crucial to reduce the large model size, long inference time, and computational overhead of BERT while retaining its accuracy.
Model compression has been widely studied in recent years [5–11] due to the increase of the model size, and there are several methods shown to be effective on BERT compression [12–17], such as knowledge distillation (KD)based, pruningbased, parameter sharingbased, and quantizationbased ones. Nevertheless, these methods have several limitations. First, KDbased methods do not give high compression rates. For example, DistilBERTbase [12] compresses the model size only to 61.2% of the original model. Second, there is a huge accuracy degradation when compressing a large proportion of BERT. For example, a pruningbased method [13] reduces the model size considerably, but it has a significant accuracy drop. Third, parameter sharingbased methods (e.g., ALBERT [14]) successfully reduce the model size, but fail to decrease the inference time. Furthermore, various approaches such as modifying pretraining task of BERT [18], designing a hardwarefriendly version of BERT [19], applying a plugandplay approach to efficiently reuse the parameters of BERT [20] have also been recently proposed to increase the computational efficiency of BERT from different perspectives.
Among these various compression methods, in this paper we focus on the field of quantization due to its superior capability of compressing models. However, previous quantizationbased methods (e.g., Q8BERT [15]) quantize BERT directly without considering the sensitivity of different modules of BERT. We define the sensitivity of a module to be the degree of change in the accuracy by the change of the module, at the same compression rate. When compressing different modules with the same compression rate, those causing severe accuracy drop are more sensitive. There are two problems for methods that do not consider the sensitivity of modules. One is that these methods lead to an accuracy drop caused by compressing too many sensitive parameters; the other is that insensitive parameters are not optimally compressed.
In this paper, we propose SensiMix, a novel quantizationbased BERT compression method. We improve the efficiency of BERT compression in the following threeaspect: model size, accuracy, and inference speed. We decrease the size of BERT by our proposed sensitivityaware mixed precision quantization, which improves the previous quantization approaches by choosing target compression ratios based on the sensitivity of modules in BERT. We demonstrate that the encoders close to the input layer are more sensitive than those near to the output layer in BERT, and SelfAttention layer is more sensitive than feedforward network (FFN) in an encoder. Hence, SensiMix quantizes these sensitive parts to 8bit and the remaining parts to 1bit. For the 8bit quantization, we introduce 8bit index quantization to reduce the model size while retaining the accuracy by using 8bit indices, minimum weight, and maximum weight to efficiently represent all weights of each layer. We initialize the SensiMix model using a pretrained BERT model and finetune the SensiMix model on downstream tasks based on the sensitivityaware quantization strategy.
We then improve the accuracy of SensiMix by our proposed three training methods for 1bit quantization. First, we propose Absolute Binary Weight Regularization (ABWR), which makes the absolute value of each 32bit floating point (FP32) fullprecision weight close to 1 in the training phase to overcome the loss of model precision. Second, we propose Prioritized Training (PT). PT lets FP32 fullprecision weights learn the binary input features before binarizing both input and weights to alleviate the accuracy drop caused from the lack of knowledge of binary input features by the initial fullprecision weights. Third, we introduce Inverse Layerwise Finetuning (ILF). ILF gradually increases the proportion of 1bit parameters during training, which alleviates the accuracy drop caused by quantizing too many parameters to 1bit at once. In the inference phase, SensiMix applies FP16 general matrix multiplication (GEMM) to the 8bit parts of the model and XNORCount GEMM to the 1bit parts to achieve a fast inference speed. Fig 1 shows that SensiMix shows the best tradeoff between accuracy, model size, and inference time.
10.1371/journal.pone.0265621.g001SensiMix shows the best tradeoff between accuracy, model size, and inference time among the competitors.
We report the average accuracy of four GLUE tasks (QQP, QNLI, SST2, and MRPC).
Our main contributions are as follows:
Sensitivityaware mixed precision quantization. We propose SensiMix, a BERT compression method which exploits mixed precision quantization considering the sensitivity of different modules. SensiMix quantizes sensitive parts of the model using 8bit index quantization and insensitive parts using standard 1bit quantization, to achieve both high compression rate and accuracy.
Training methods for improving accuracy. We propose Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layerwise Finetuning for training the 1bit parts of the model, which alleviate the accuracy drop caused by applying the 1bit quantization.
Inference strategy. We apply FP16 GEMM to the 8bit parts of the model and XNORCount GEMM to the 1bit parts for achieving a fast inference speed.
Experiments. We conduct the experiments on four GLUE downstream tasks. Experiments show that SensiMix compresses BERT 8× in terms of model size and gives 5× faster inference speed. Ablation study shows our three 1bit training methods ABWR, PT, and ILF improve the average accuracy of SensiMix by 1.1%, 1, 4%, and 1.4%, respectively, compared to without applying them.
In the rest of this paper, we first introduce the related works and describe our proposed method. Then, we experimentally evaluate the performance of SensiMix and its competitors. The code of SensiMix is available at https://github.com/snudatalab/SensiMix. Table 1 shows the symbols used in this paper.
10.1371/journal.pone.0265621.t001Table of symbols.
Symbol
Definition
a_{fp}
Fullprecision activation
a_{b}
Binary activation
w_{fp}
Fullprecision weight
w_{b}
Binary weight
Wfpl
Fullprecision weight matrix of the layer l
Wbl
Binary weight matrix of the layer l
Wql
Quantized 8bit weight matrix of the layer l
Wdql
Dequantized FP32 weight matrix of the layer l
L
Loss function
γ
Strength of the ABWR regularization term
O_{ij}
(i, j)^{th} element of the output matrix
Related workBERT
BERT [1] is a pretrained language model which has achieved stateoftheart performance on many downstream natural language processing tasks. The model consists of a WordPiece embedding layer, several Transformer [21] encoder layers, and a taskdependent classifier. The BERT model is trained with a large corpus data on masked language modeling and next sentence prediction tasks. The BERTbase model showed great success in a variety of NLP tasks, but it has 110M parameters which has problems of huge memory cost and long inference time. Hence, there has been a growing interest in compressing BERT to a tiny one through various methods such as pruning, parameter sharing, knowledge distillation, and quantization.
Network pruning
Network pruning is a technique to reduce the model size by removing the weights of a deep neural network, and it has been widely studied in recent years [5, 22, 23]. Based on the assumption that deep neural networks have redundancy, and many parameters in a deep neural network are unimportant or unnecessary, network pruning is used to remove unimportant parameters. In this way, pruning methods increase the sparsity of the parameters significantly. After pruning, the sparse model requires less space, by using compressed sparse row format (CSR) or compressed sparse column (CSC) format.
BERT pruning
Gordon et al. [13] explore the effectiveness of weight pruning on BERT and figure out how pruning during pretraining affects the accuracy of the model. The authors find that pruning affects the performance of BERT in three broad regimes. Low levels of pruning (3040%) have a negligible effect on pretraining loss and transferring the knowledge to downstream tasks. Medium levels of pruning (4070%) increase the pretraining loss and hinder useful pretraining information from being transferred to downstream tasks. High levels of pruning (7090%) additionally prevent models from fitting to downstream tasks, leading to further accuracy degradation. Experiments show that pruning equal to or less than 60% parameters of BERT affects the accuracy of the model marginally on GLUE benchmark [24] tasks, but pruning more than 70% starts to considerably influence the accuracy. On top of that, the method has a significant disadvantage: it is difficult to reduce the inference time without special hardware [25]. Compared to pruning methods, SensiMix reduces both the number of layers and size of parameters without pruning any parameter in the model, retaining higher accuracy and gaining faster inference speed.
Parameter sharing
Parameter sharing is a widely used technique to reduce the number of parameters of a model [6, 26]. The basic idea of parameter sharing is to use the same parameters across different layers in the model. However, despite the desirable reduction in the number of parameters, it generally suffers from a considerable accuracy drop.
ALBERT
ALBERT [14] is a novel lightweight version of BERT using the parameter sharingbased method. The crosslayer parameter sharing method is one of the main ideas of this model, which is to share the parameters across the layers of BERT. The authors use various parameter sharing methods such as allshared, sharedattention, and sharedFFN methods. All these sharing schemes reduce the accuracy of the model, but the authors scaled up the size of the model after sharing parameters not only to recover but exceed the original accuracy of the model. However, ALBERT fails to reduce the inference time because the input still needs to pass through 12 or more layers, which is a critical disadvantage since inference time is a significant factor in model compression. Compared to ALBERT, SensiMix compresses the model via reducing the number of layers and quantizing the parameters, leading to much faster inference speed.
Knowledge distillation
Knowledge distillation (KD) [7] aims to train a compact or smaller student model to approximate the function learned by a large and complex teacher model. Recently, KD has become one of the main techniques for model compression, by training a small student model to imitate the soft output labels of a large teacher model. Romero et al. [27] demonstrate that intermediate representations learned by the large model could serve as hints to improve the training process and the final performance of the student model. Moreover, Liu et al. [28] apply KD to transfer knowledge from ensemble models to improve the performance of a single model on naturallanguage understanding (NLU) tasks.
DistilBERT
DistilBERT [12] explores the problem of compressing BERT by applying KD. DistilBERT has the same general architecture as BERT, but the number of encoder layers is reduced by a factor of 2. DistilBERT applies KD on a student model during pretraining, allowing the student model to learn the soft label of the teacher’s outputs. DistilBERT is trained on huge batches leveraging gradient accumulation using dynamic masking and without the next sentence prediction task. Experiments show that DistilBERT achieves a good result on the GLUE benchmark with only a slight accuracy drop compared to BERTbase. However, the downside of DistilBERT is that even though it reduces half of the encoder layers, the embedding layer consists of about 21% of the total parameters, making the model size to only 61.2% of the original. Compared to DistilBERT, SensiMix additionally compresses the parameters by quantization, reducing the model size and improving the inference speed.
Network quantization
Network quantization uses smaller bitwidth integers to represent and compress parameters of deep neural networks. A typical deep learning model uses 32bit floating point (FP32) format for its parameters. [8, 29] demonstrate that weights and activations can be represented using 8bit numbers without significant accuracy drop. The use of even lower bitwidths such as 4, 2, and 1bits has also shown remarkable progress [30–33]. For example, binarized neural network (BNN) [32] uses 1bit for each parameter to save storage and reduce computation. However, the quantized model, such as BNN, causes severe loss of precision and accuracy drop.
Q8BERT
Q8BERT [15] applies 8bit quantizationaware training during the finetuning process of BERT. In forward propagation, Q8BERT first quantizes the activation and weight matrices to INT8 format by multiplying two scaling factors of the two matrices, performs INT8 GEMM, which multiplies and accumulates two INT8 matrices to an INT32 matrix, and then dequantizes the INT32 matrix to an FP32 matrix by dividing the scaling factors of the weight and activation matrices. In backward propagation, Q8BERT uses the 8bit clip function to approximate the 8bit round function for training the model. Q8BERT reduces the model size by a factor of 4 and maintains almost the same accuracy as the BERT with FP32 precision in eight different NLP tasks. However, Q8BERT does not consider the sensitivity of layers in BERT and does not apply mixed precision quantization strategy; on the other hand, our SensiMix applies the mixed precision quantization considering the sensitivity and provides a superior performance compared to Q8BERT.
KDLSQBERT & Kmeans quantization
Jin et al. [33] propose KDLSQBERT, a framework to combine quantization with knowledge distillation using BERT. KDLSQBERT adopts learned step size quantization (LSQ; [34]), which is a variant of the original quantizationaware training, which has been shown to be effective in computer vision. Different from the ordinary quantizationaware training, LSQ additionally learns the scale factor for each weight and activation during the training process. By applying LSQ along with KD on BERT, KDSLQBERT shows decent tradeoff between performance and memory footprint. Also, Zhao et al. [35] propose a variant of quantizationaware training by adopting the idea of kmeans clustering into quantization. They show that this kmeans clusteringbased variant is comparable to the original quantizationaware training on BERTbased models.
In our paper, since considering every variant of quantization method is impractical, we focus on the linear and symmetric quantizationaware training, which is the most basic and universal quantization scheme, to verify the effectiveness of our SensiMix. Since SensiMix proves to be effective on this most fundamental quantization scheme, we believe SensiMix can be deployed for those variant settings as well. In addition, the three proposed training methods ABWR, PT, and ILF of SensiMix are orthogonal techniques making them complementary to other mixed and/or lowprecision quantization methods.
IBERT
Kim et al. [36] present IBERT which quantizes the entire inference of BERTbased models with integeronly arithmetics, avoiding any floating point calculations. The major merit of integeronly inference is that it can benefit from faster inference speed by using a family of specialized hardware that supports efficient integer computations (e.g. Turing Tensor Cores, ARM CortexM, etc.). Specifically, IBERT approximates GELU and Softmax functions with secondorder polynomials and LayerNorm with a fair number of iterations of integer arithmetic, which can then be evaluated with integer arithmeticonly hardware devices.
Different from IBERT, SensiMix focuses on simulated quantization, which includes both quantization and dequantization steps in inference, rather than restricting the setting to integeronly. This results in higher flexibility and applicability of SensiMix.
EvoQ & Lee et al. [<xref reftype="bibr" rid="pone.0265621.ref037">37</xref>]
Yuan et al. [38] present EvoQ, a posttraining quantization method which applies tournament selction, a classical evolutionary algorithm, to find the best quantization policy. Their sensitivity metric measures the output and intermediate layer differences between the quantized model and the fullprecision model. Using this sensitivity metric along with the tournament selection algorithm, EvoQ improves the search efficiency for finding the best quantization policy.
Similarly, Lee et al. [37] also propose a metric to measure the layers’ sensitivity to quantization. They measure the sensitivity of a specific layer by considering the effect of quantization on both the task loss (final output) and other intermediate layers using the concept of gradient perturbation. They also develop a neural networkagnostic data generation method to improve the quality of the quantized network.
These works are similar to ours in the sense that they attempt to measure the sensitivity of each layer to quantization. However, both of them are specialized to CNNbased models and thus not directly applicable to BERT.
Proposed method
We propose SensiMix, a sensitivityaware mixed precision quantization method for BERT compression. We first provide a brief overview of our method. Then, we describe the sensitivityaware mixed precision quantization strategy. After introducing the 8bit index quantization and the 1bit value quantization, we describe how to perform inference of our model.
Overview
Our goal is to compress BERT to a fast and lightweight model. We concentrate on the following challenges for the goal.
Accuracy degradation caused by compression. Many existing BERT compression methods lose the accuracy after compression. For example, Gordon et al. [13] reduce the model size of BERT considerably, but the accuracy is degraded significantly. How can we compress BERT to a lightweight one while keeping its accuracy?
Challenges of 1bit quantization. We apply 1bit quantization to insensitive parts of the model, but there are several challenges when applying 1bit quantization. First, quantizing the original fullprecision weights to 1 or 1 leads to a huge precision loss. Second, binarizing both weights and activations at the beginning of the training phase causes an accuracy drop due to the lack of binary input feature knowledge learned by the pretrained FP32 precision model. Third, models that quantize too many parameters to 1bit at once are hard to be trained. How can we overcome these challenges and improve the accuracy of the quantized model?
Limitation of inference speed. Many BERT compression techniques do not improve the inference speed. For example, ALBERTlarge [14] has about 5× fewer parameters compared to BERTbase, but the inference time is 3× longer. How can we achieve a fast inference speed while maintaining a small model size and similar accuracy?
We address the mentioned challenges with the following ideas:
Sensitivityaware mixed precision quantization. SensiMix quantizes the sensitive parts of BERT using 8bit index quantization and insensitive parts using 1bit value quantization, which maximizes the compression rate and minimizes the accuracy drop caused by quantizing sensitive modules to low bits.
Training methods for 1bit quantization. We propose three training methods for the 1bit quantization: Absolute Binary Weight Regularization (ABWR), Prioritized Training (PT), and Inverse Layerwise Finetuning (ILF). They overcome the challenges of the conventional 1bit quantization method mentioned above, improving the accuracy of the model.
Fast matrix multiplications for inference. We apply FP16 general matrix multiplication (GEMM) and XNORCount GEMM to replace the original GEMM for the 8bit and the 1bit quantization parts of the model to achieve fast inference speed.
Fig 2 compares the architecture of SensiMix to that of the original BERT. SensiMix effectively applies 8bit index quantization and the 1bit value quantization to sensitive and insensitive modules of BERT, respectively.
10.1371/journal.pone.0265621.g002Overview of SensiMix.
SensiMix applies 1bit value quantization to insensitive feedforward network (FFN) near the output layer, and applies 8bit index quantization to remaining sensitive parts.
Our goal is to find an effective quantization strategy to compress parameters in BERT while maintaining the accuracy of the model. BERT consists of a WordPiece embedding layer and 12 Transformer [21] encoder layers, where each encoder layer is composed of a SelfAttention layer and a feedforward network (FFN). There is a classifier after the last encoder layer. Previous BERT compression methods do not consider the sensitivity of different modules of BERT, which leads to two problems. First, they cause an accuracy drop by compressing too many sensitive parameters of the model. Second, they do not compress parameters efficiently since the insensitive parts have not been optimally compressed.
Input:lth layer’s input matrix I^{l}, fullprecision weight matrix of the lth layer Wfpl, the number of encoder layers N, index of the lth layer l, and the activation function Act
Output: The updated weight matrix of lth layer Wupdatedl
19 Use the 8bit clip function to replace the round function when computing the gradient of Wfpl // (Eq 4)
20 Update Wfpl to Wupdatedl
21else iflth layer is the MP encoderthen
22 Use the 1bit clip function to replace the round function when computing the gradient of W^{l} // (Eq 6)
23 Update Wfpl to Wupdatedl
24l ← l − 1
To alleviate these problems, we pay attention to the sensitivity of different modules of the model. First, we discover that the SelfAttention layer is more important than FFN in an encoder. The SelfAttention layer calculates the relations between input word embeddings, which plays a crucial role in improving the accuracy of BERT. Moreover [39, 40], demonstrate that SelfAttention distributions of pretrained language models capture a rich hierarchy of linguistic information, which reveals the importance of the SelfAttention layer. We also discover that the encoders close to the input layer are more sensitive than the encoders near the output layer in BERT. The encoders near the input layer extract important lowlevel features from the input embeddings, which are crucial for the model accuracy. Lin et al. [41] show that the layers close to the input layer have the most important information about linear word order, which signifies the importance of these layers. We verify these claims in the experiments part.
Quantization strategy
There are many choices of the number of bits to quantize the fullprecision model. As we lower the number of bits used, we can save more memory, but the model accuracy also falls significantly. We basically want to compress the model to 1bit because it provides good compression rate and fast inference speed, but applying 1bit value quantization to all modules causes severe accuracy degradation. On the other hand, 8bit index quantization reduces the memory storage needed to onefourth while maintaining most of the accuracy compared to the FP32 model. Hence, we apply 1bit value quantization to insensitive parts of the model and 8bit index quantization to the remaining parts.
Based on the motivation, we propose SensiMix which contains two types of encoders, 8bit encoder and Mixed Precision (MP) encoder. An 8bit encoder is composed of 8bit indexquantized FFN and SelfAttention layer, and an MP encoder is composed of 1bit valuequantized FFN and 8bit indexquantized SelfAttention layer. Given n layers from a pretrained BERT model, we use m MP encoders near the output layer, and n − m 8bit encoders for remaining n − m layers. Note that if n is smaller than the total layers of the BERT model, we initialize the model using the lower layers of the BERT model. For example, we initialize the SensiMix (3+3) model using layers 16 of the BERTbase model. Additionally, we apply 8bit index quantization to the embedding layer. We do not quantize the bias layers, the LayerNorm layers, and the classifier since they occupy only a small part of the model. One advantage of SensiMix is that it is more flexible than existing methods (e.g., Q8BERT) because the type of encoder, as well as the number of layers, is also flexible. The overall process of SensiMix is shown in Algorithm 1.
SensiMix is trained during the finetuning stage of BERT. We initialize the model using a pretrained BERT model and finetune the SensiMix model on downstream tasks based on the above sensitivityaware mixed precision quantization strategy.
8bit index quantization
We describe our 8bit index quantization method in detail. The main idea of 8bit index quantization method is to transform and shrink the original 32bit weight matrices into 8bit matrices composed of integer indices, thereby reducing the memory needed to save the model weights. As we lower the number of bits used, we can save more memory, but the model accuracy also falls significantly. Therefore, an appropriate compromise is required, and we empirically choose 8bit for our index quantization in SensiMix. Specifically, 8bit index quantization reduces the memory storage needed to one fourth while maintaining most of accuracy in comparison to the FP32 version of the model. 8bit index quantization consists of two steps: quantization and dequantization. The quantization step quantizes FP32 values to 8bit int values to reduce the memory requirement, and the dequantization step dequantizes the values back to 32bit to make the output similar to the original FP32 model. In the following, we elaborate on the quantization step, the dequantization step, and how we apply them in the training process. Fig 3 shows the overall process of the 8bit index quantization.
10.1371/journal.pone.0265621.g003The overall process of the 8bit index quantization.
In the forward propagation, the original weights are first quantized to the 8bit indices and then dequantized back to FP32 weights, and in the backward propagation, we use the 8bit clip function to replace the round function to train the model.
Quantization
The quantization step reduces FP32 weight matrices into INT8 weight matrices. From an FP32 weight matrix, each FP32 weight is mapped to a number within [128, 127] range (8bit). We divide the minimum and the maximum weight range into 256 uniform intervals. Then for each weight in the matrix, if the value is in the i^{th} interval, i = 0, …, 255, then the integer (i128) is assigned as its index. Through this quantization step, we map the FP32 weight matrix (the weight matrix of each layer) into an INT8 index matrix, which saves memory storage by a factor of 4. In addition to the mapped INT8 index matrix, we also store the minimum and maximum values of the original FP32 weight matrix for the later dequantization step. For example, assume the original FP32 weight matrix consists of four numbers: 1.28, 0.005, 1.00, and 1.27. The range of the weights is [1.28, 1.27] so the quantization unit is 0.01. Since 1.28, 0.005, 1.00, and 1.27 are in the 0^{th}, 128^{th}, 228^{th}, and 255^{th} units, respectively, they are assigned indices 128, 0, 100, and 127, respectively. After the quantization step, the integer indices 128, 0, 100, 127 along with the minimum and maximum values 1.28 and 1.27 are stored. The formula of this quantization step is as follows:
Wql=round[(Wfplmin(Wfpl))×127(128)max(Wfpl)min(Wfpl)128],
where Wql denotes the 8bit indexquantized weight matrix of the l^{th} layer, Wfpl denotes the FP32 fullprecision weight matrix of the l^{th} layer, min(Wfpl) and max(Wfpl) denote the minimum and maximum weights of the fullprecision matrix Wfpl, respectively.
Dequantization
In addition to the quantization step, a dequantization step is required to preserve the accuracy of the original FP32 model. Since quantized index values differ much from the original FP32 weight values, directly using the quantized indices produces a very different output compared to the original FP32 model. Therefore, we use a step called dequantization to map each quantized index back to an FP32 value similar to the original weight, making the output similar to that of the FP32 model. The process of dequantization is roughly a reverse process of quantization. We map the indices back to FP32 values based on the minimum and maximum weights we stored in the quantization step. Specifically, we divide the minimum and maximum weight range stored from the quantization step into 256 uniform units. Then for each index j (−128 ≤ j ≤ 127) we assign the beginning FP32 value of the (j + 128)^{th} unit. For example, from the former example of quantization, the minimum and maximum weight values that we stored are 1.28 and 1.27, respectively. Thus, the interval unit is 0.01 and the indices 128, 0, 100, and 127 are mapped back to 1.28, 0.00, 1.00, and 1.27, respectively, similar to the original weights 1.28, 0.005, 1.00, and 1.27. The formula of this dequantization step is as follows:
Wdql=[(Wql(128))×max(Wfpl)min(Wfpl)127(128)]+min(Wfpl),
where Wdql denotes the 8bit dequantized weight matrix of the l^{th} layer. The quantization step enables us to reduce the needed memory storage and the dequantization step enables us to preserve the accuracy of the model.
Training with 8bit index quantization
Applying the 8bit index quantization method on deep learning models requires slight modifications. We describe the required modifications of the training process.
In the forward propagation of the training process, we apply quantization and dequantization on the weights of the model. We then use the dequantized weight matrices instead of the original weight matrices to calculate the forward propagation. In the backward propagation, we update weights using standard gradient descent. However, the derivative of the round function is equal to zero in almost all ranges, making the weights incapable of being updated in gradient descent based training. To tackle this, we use the 8bit clip function to approximate the round function to update the weights. The update rule and the 8bit clip function are as follows:
Wfpl,t+1=Wfpl,tη∂L∂Wfpl,t=Wfpl,tη∂L∂Wdql,t∂Wdql,t∂Wfpl,tround(x)≈clip(x,128,127)=min(max(x,128),127),
where L is the loss function of the model, superscripts t and t + 1 represent the moments before and after weight updates, respectively. Note that in Eq 1, all original FP32 weight values are mapped into range [128, 127] before input to the round function, which means all FP32 weights will be updated. After the training is done, we only store the quantized 8bit index weight matrices and the maximum and minimum weights of each matrix, which reduces the model size by a factor of 4.
1bit value quantization and additional techniques
We apply 1bit value quantization instead of 8bit index quantization for the insensitive weight matrices to further compress the model while maintaining most of its accuracy. Differently from the 8bit index quantization, 1bit value quantization does not have a dequantization process; the weights are quantized to ±1 and directly propagated to the next layer of the model. For 1bit value quantization, we adopt the approach [32] that reduces the model size by binarizing both weights and activations. The reason for binarizing both activations and weights is to apply XNORCount GEMM in the inference phase and boost the inference speed. We also propose three additional 1bit training methods called ABWR, PT, and ILF that minimize the accuracy drop caused by 1bit value quantization. In the following, we first introduce the method of 1bit value quantization in both forward and backward propagations and then introduce our three proposed 1bit quantizationaware training methods.
Forward propagation
In the forward propagation, we first use the sign function to binarize weights and activations to either + 1 or − 1. We then use the binarized weights and activations to perform the standard forward process of the model. The mathematical formulas are as follows:
ab=sign(afp)={1(afp≤0)+1(afp>0),wb=sign(wfp)={1(wfp≤0)+1(wfp>0),
where a_{fp}, w_{fp}, a_{b}, and w_{b} represent fullprecision activation, fullprecision weight, binary activation, and binary weight, respectively.
Backward propagation
We describe the backward propagation, showing how to update the binary weights Wbl in the l^{th} layer. The derivative of the sign function is equal to zero in almost all ranges, making it incompatible with gradient descentbased training. To tackle this, 1bit value quantization uses the 1bit clip function to approximate the sign function to update Wfpl:
Wfpl,t+1=Wfpl,tη∂L∂Wfpl,t=Wfpl,tη∂L∂Wbl,t∂Wbl,t∂Wfpl,tclip(x,1,1)=min(max(x,1),1),
where L is the loss function of the model, η is the learning rate, and W_{b} represents the binary weight matrix. By replacing the sign function with the 1bit clip function, ∂Wbl,t(i,j)∂Wfpl,t(i,j) is equal to 1 if Wfpl,t∈[1,1] and 0 otherwise. Note that Eq 6 is a basic training rule; other gradient descentbased update rules (e.g., Adam) can be used as well.
Absolute Binary Weight Regularization (ABWR)
We propose Absolute Binary Weight Regularization (ABWR), a regularization method to reduce the precision loss caused by applying 1bit quantization. Our goal is to reduce the precision loss by learning a new weight distribution that fits the 1bit value quantization. Fig 4 shows the fullprecision weight distributions of three binarized FFN in SensiMix (3+3) before and after applying ABWR on the QQP task. Note that 90% of the weights of pretrained BERT are in the range of [−0.07, 0.07], which are far from ±1. This is one of the main causes of the big drop in accuracy when 1bit quantization is applied. To tackle this, we introduce a regularization term L_{R} that restricts the absolute value of fullprecision weights close to 1 in the training phase:
LR=12(wfp1)2,
and the overall loss function is as follows:
L=LB+γ×LR,
where L_{B} denotes the original objective function of BERT, and γ denotes the regularization coefficient.
10.1371/journal.pone.0265621.g004Fullprecision weight distributions of three binarized FFN in SensiMix (3+3) before and after applying ABWR on the QQP task.
The intuition of ABWR is to train the absolute value of the weights to become close to 1 in the first place, thereby minimizing the drop in accuracy when 1bit quantization is applied. By adding this regularizer in the loss function during training, the model reduces the precision loss caused by applying 1bit quantization.
Prioritized Training (PT)
We propose a new training method called Prioritized Training (PT) to overcome the difficulty in training the model with binary weights. Conventional 1bit quantization methods binarize both input and weights from the beginning of the training, which means the fullprecision weights have no chance to consider any binary input feature before applying 1bit quantization. Hence, conventional methods do not give a good initial state of the model for 1bit quantization because the initial fullprecision weights have no knowledge about binary input. This is one of the causes that 1bit quantized models are difficult to be trained. To tackle this problem, PT keeps the input binarized as in the conventional methods, but trains the weights in FP32 precision first, and then applies 1bit quantization. The intuition behind PT is to provide a better initial state to the model for 1bit quantization by training the fullprecision weights to learn more binary input features before binarizing both input and weights. Due to this additional training, the model performs better than the traditional methods.
Inverse Layerwise Finetuning (ILF)
We propose Inverse Layerwise Finetuning (ILF) to overcome the difficulty in training a SensiMix model that applies 1bit value quantization to a large proportion of the model at once. Fig 5 shows the process of ILF. We observe that the models that deploy half or more MP encoder layers at once are hard to be trained. To tackle this problem, our proposed ILF first finetunes the model starting with one MP encoder layer, letting it converge. The first MP encoder is deployed on the top of the model (close to the output layer). After 1 training epoch, we iteratively perform the following procedure, for k = 1, 2, …, m − 1: given the model with k MP encoder layers, we add one more MP encoder layer on the bottom of the existing MP encoder layers and finetune the model with k + 1 MP encoder layers for 1 epoch. We stop the iteration when we have total m MP encoder layers. Note that we do not freeze any parameter of the model in the process. ILF enables a more effective training of the final model with a gradual quantization approach.
Given the model with k MP encoder layers, ILF adds one more MP encoder layer on the bottom of it and finetune the model with k + 1 MP encoder layers.
Model inference
We describe our implementation details for fast model inference.
Inference for 8bit
For 8bit index quantization, we adopt FP16 GEMM to replace the FP32 GEMM. FP16 GEMM is a traditional GEMM where all elements in matrices are in FP16 precision. After training, we store the 8bit indexquantized weight matrix, minimum weight, and maximum weight of each weight matrix. In the inference phase, we first dequantize the saved INT8 indices back to FP32 precision weights following Eq (2). We then convert all the FP32 precision weights and activations to FP16 precision following the IEEE 754 standard [42]. After that, FP16 precision weights and activations are used instead of the FP32 version throughout the entire process. We also apply FP16 precision to the bias and LayerNorm layers to achieve a fast inference speed.
Inference for 1bit
For 1bit value quantization, we apply XNORCount GEMM instead of the traditional GEMM to achieve an even faster inference speed. XNORCount GEMM is a fast matrix multiplication method for binary matrices, which is widely used in the inference phase of binarized neural networks. We explain how XNORCount GEMM works with an example. Consider multiplying two binary matrices (all elements are 1 or 1) A and B. For computing the (i, j)^{th} element of the output matrix, XNORCount GEMM first performs XNOR operation on the i^{th} row vector of A and the j^{th} column vector of B and yields an output vector. Then the method counts the number of 1 in the output vector, multiplies the result by 2, and deducts the dimension of the i^{th} row vector of A to get a scalar value. This value is equal to the (i, j)^{th} element of the output matrix. The mathematical formula of XNORCount GEMM is as follows:
Oij=2×count(xnor(Ai,Bj))dim(Ai),
where O_{ij} represents the (i, j)^{th} element of the output matrix, A_{i} represents the i^{th} row vector of the binary matrix A, B_{j} represents the j^{th} column vector of the binary matrix B, and dim(A_{i}) represents the dimension of the vector A_{i}.
However, it is difficult to directly apply this method in practice because modern deep learning frameworks do not support saving each value in 1bit numerical format. Therefore, in the implementation step, we encode each binary matrix to an FP32 matrix where each FP32 element stores 32 binary values. We rowencode the left matrix and columnencode the right matrix for multiplication, and elements with values 1 and 1 are encoded to bits 1 and 0, respectively. We apply XNORCount GEMM on these FP32 matrices. For example, suppose we multiply two binary matrices A and B with sizes (i, m) and (m, j), respectively. We first encode them into FP32 precision matrices with sizes (i, m/32) and (m/32, j) where each FP32 number consists of 32 encoded 1bit values. We then apply XNORCount GEMM to the encoded matrices with bitwise operations as follows:
Oij=2×∑k=1m/32[popcount(xnor(Aik,Bkj))]32×dim(Ai),
where popcount represents the bitwise count operation. XNOR and popcount operations require fewer computational cost than the traditional GEMM, thus resulting in a faster inference speed.
Theorem 1 (FLOPs of SensiMix). Let F_{B}be the FLOPs of the fullprecision (32bit) BERTbase model (12 layers). The FLOPs F_{S}of SensiMix with nlayers and mMP encoders is given byΘ((124n5192m)FB).
Proof. We omit the bias, LayerNorm layers, and final classifier since they occupy a negligible part of the model. The SelfAttention layer occupies 13 size in an encoder and FFN occupies the remaining 23. The layer with the 8bit index quantization uses the FP16 GEMM to make inference which has half of the FLOPs compared to FP32 GEMM. The layer with the 1bit value quantization uses XNORCount GEMM to make inference which has 1/32 (both input and weight are 1bit) of the FLOPs when using 32bit popcount operations. Then,
FS=((nm)×12×FB12)+(13m×12×FB12)+(23m×132×FB12).
Thus,
FS=(124n5192m)FB.
We observe that the FLOPs of SensiMix is much less than that of BERT, and decreases when the number m of MP encoders increases. The inference speed of SensiMix (3+3) is 62415192=5.82 times faster than BERTbase. Tables 2 and 3 show that SensiMix (3+3) gives 5.05 times faster inference time than BERTbase, which is close to the theoretical speedup.
10.1371/journal.pone.0265621.t002Overall performance of SensiMix compared to the competitors.
SensiMix achieves the smallest model size and the least inference time while maintaining the similar accuracy.
Methods
Model Size (%)
Inference Time (Sec)
QQP (Acc)
QNLI (Acc)
SST2 (Acc)
MRPC (F1)
Avg.
BERTbase
100.0
480
90.8±0.3
90.5±0.3
92.1±0.2
90.2±0.5
90.9
BERT pruning 70%
30.0

88.6±0.5
86.4±0.2
89.5±0.3
87.3±0.1
88.0
BERT pruning 80%
20.0

86.9±0.2
82.3±0.6
87.1±0.6
84.8±0.4
85.3
BERT pruning 90%
10.0

81.1±0.2
72.5±0.7
80.3±0.3
83.1±0.3
79.3
DistilBERT
61.2
248
89.9±0.4
88.5±0.4
91.3±0.3
89.4±0.3
89.8
ALBERTlarge
16.2
1401
90.9±0.4
90.1±0.3
91.8±0.1
90.1±0.1
90.7
Q8BERT
25.0

90.2±0.0
88.9±0.5
91.5±0.3
89.6±0.2
90.1
SensiMix (3+6)
17.3
140
90.4±0.3
89.0±0.5
92.0±0.1
90.0±0.1
90.4
SensiMix (3+3)
12.5
95
89.6±0.2
86.5±0.2
90.3±0.1
87.2±0.2
88.4
10.1371/journal.pone.0265621.t003Inference time of SensiMix compared to the competitors.
SensiMix achieves the fastest inference speed compared to the competitors.
Methods
QQP (sec)
QNLI (sec)
SST2 (sec)
MRPC (sec)
Avg. (sec)
BERTbase
1362±20
320±5
198±5
9±1
480
DistilBERT
712±8
168±4
101±3
6±1
248
ALBERTlarge
4087±30
900±18
591±15
24±2
1401
SensiMix (3+6)
408±12
89±5
61±4
3±1
140
SensiMix (3+3)
272±10
66±3
40±3
2±1
95
Experiments
We run experiments to verify the effectiveness of our proposed method. Our goal is to answer the following questions.
Q1. Overall performance. How does SensiMix perform compared to other methods in terms of accuracy, model size, and inference speed?
Q2. Effectiveness of 1bit training methods. How do our ABWR, PT, and ILF affect the accuracy of SensiMix?
Q3. Sensitivity. Which part of the model is more sensitive in an encoder, and which encoder is more sensitive in the model?
Experimental settingsDataset
We assess the performance of SensiMix on the General Language Understanding Evaluation (GLUE) benchmark [24]. GLUE is an evaluation system widely used in NLP. GLUE covers a wide range of tasks, including singlesentence tasks, paraphrase tasks, and inference tasks, to evaluate the model’s overall performance as a language model. We choose QQP, QNLI, SST2, and MRPC tasks for our evaluation because they cover all these three areas of GLUE benchmark. QQP is a sentence pair classification task that aims to indicate whether a pair of questions on Quora website is a duplicate or not. QNLI is a sentence pair classification task for predicting whether a questionanswer pair is an entailment or not. SST2 is a single sentence classification task with two annotations, which aims to predict the sentiment of movie reviews. MRPC is a sentence pair classification task, which aims to predict the semantic equivalence between each pair of sentences. We report scores on the development sets of these tasks by finetuning on each task.
Competitors
We compare the performance of our proposed SensiMix to the following competitors.
BERT pruning. This method compresses BERT by applying a pruning technique. The method prunes small weights in the embedding and encoder layers of BERT.
DistilBERT. This method uses half of the layers of BERTbase model and applies knowledge distillation during pretraining. DistilBERT consists of an embedding layer and six Transformer encoder layers with the same dimensions as BERTbase.
ALBERT. This method applies parameter sharing and embedding factorization to compress BERT. Specifically, the model uses crosslayer parameter sharing on the Transformer encoder layers and applies matrix factorization on the original embedding layer of BERT.
Q8BERT. This method applies 8bit quantization to BERT. It quantizes both input and weights to INT8 numerical format and multiplies them using simulated INT8 GEMM.
Model architecture
The architecture of SensiMix is the same as that of BERT except for the type and the number of encoder layers. SensiMix consists of two types of encoders. One is 8bit encoder, which applies 8bit index quantization to all the parameters in the encoder layer except bias and LayerNorm layers. The other is mixed precision (MP) encoder, which applies 8bit index quantization to SelfAttention layer and applies 1bit value quantization to FFN. In our experiments, we choose SensiMix (3+6) and SensiMix (3+3) as our favorites considering the model size, accuracy, and inference time altogether. SensiMix (3+6) represents SensiMix with nine layers, three of which are MP encoders and the other six are 8bit encoders. SensiMix (3+3) has six layers, half of which are 8bit encoders and the other half are MP encoders. The first model has outstanding accuracy with a relatively small model size compared to the competitors. The second model has a better compression rate and inference speed, with slightly decreased but still comparable accuracy.
Model training
We train 6 epochs for all tasks (QQP, QNLI, SST2, and MRPC), and set the initial learning rate to 3e5, maximum sequence length to 128, and training batch size to 16.
Model inference
We set the maximum sequence length to 128 and batch size to 128. We use a single NVIDIA RTX 2080Ti GPU for all the experiments. We conduct experiments in the GPU environment because it is one of the most commonly used hardware in recent deep learning research. The SensiMix model is implemented using PyTorch and XNOR GEMM is implemented using PyTorch CUDA extension. We initialize SensiMix by a pretrained BERT model and conduct experiments on the four GLUE benchmark downstream tasks mentioned above. We run each experiment five times and report the average and standard deviation.
Overall Performance (Q1)
We summarize the accuracy, model size, and inference speed of SensiMix along with the competitors. Table 2 and Fig 6 show the overall performance of our proposed SensiMix compared to the competitors. We analyze the experimental results from two perspectives: (1) accuracy vs. model size and (2) inference speed.
10.1371/journal.pone.0265621.g006Accuracy vs. model size for QQP, QNLI, SST2, and MRPC tasks.
SensiMix shows the best tradeoff between accuracy and model size. The two points of SensiMix represent SensiMix (3+3) and SensiMix (3+6). The three points of BERT pruning represent the pruning with ratios of 90%, 80%, and 70%.
Accuracy vs. model size
We examine the relations of the accuracy and the model size of SensiMix compared to the competitors. Fig 6 shows the results for the four GLUE tasks, where xaxis and yaxis denote the model size and the accuracy (f1 score for MRPC), respectively. The percentage (%) of model size is calculated with respect to BERTbase, which means BERTbase has 100% model size. SensiMix achieves the best compression rate while maintaining comparable accuracy. Specifically, BERTbase shows the best average accuracy which is 90.9, but it has the largest model size. By contrast, Our SensiMix (3+6) achieves a very similar accuracy which is 90.4 but up to 6 times smaller model size compared to BERTbase. The pruning method shows a decent accuracy for the percentage of model size in 20% and 30% while it suffers from a significant accuracy drop when the percentage of model size becomes 10%. However, our SensiMix (3+3) shows a much better accuracy compared to the pruning method with the similar model size. DistilBERT gives a good accuracy on the tasks, but it still has a large model size which is 61.2% of BERTbase. ALBERTlarge also shows a good accuracy with a small model size, but it has a significant disadvantage that the inference time is much longer than the competitors, as shown in Table 2.
Compared to Q8BERT, SensiMix gains benefits in both accuracy and model size. Q8BERT shows a decent tradeoff in accuracy and model size, which are 25% of the original model size and 90.2 of average accuracy. However, our SensiMix (3+6) reduces the model size to 7.7% of the original model, while achieving an even higher accuracy which is 90.4. Besides, our SensiMix (3+3) has half of the model size of Q8BERT with only 0.6 average accuracy degradation. Overall, SensiMix gives the best tradeoff in terms of accuracy and model size.
Inference speed
We assess the inference speed of SensiMix and its competitors in Table 3. We set BERTbase as our baseline model. We do not add the BERT pruning method in the comparison because the pruning method requires special hardware to accelerate the inference speed. We also do not consider Q8BERT as our competitor in inference speed because Q8BERT only simulates the process of INT8 inference by using fake quantization. According to [15], Q8BERT can be accelerated in a special hardware, but SensiMix is designed to make inference in the NVIDIA GPU environment.
In Table 3, note that BERTbase takes a long inference time which is 480 seconds, and SensiMix shows the best inference speed among the competitors. ALBERTlarge shows about 3× longer inference time than that of BERTbase, since it still consists of 24 parametershared encoder layers. DistilBERT achieves a noticeable improvement in inference time which is 248 seconds in average because it has only half of the encoder layers compared to BERTbase. SensiMix (3+3) takes only 95 seconds to make inference on all the tasks’ training sets on average, which is about 5× faster than BERTbase, with only a marginal drop in accuracy.
Effectiveness of 1bit training methods (Q2)
We describe through experiments the effectiveness of our proposed three 1bit quantizationaware training methods, ABWR, PT, and ILF. We choose SensiMix (3+3) as our representative model and evaluate the effectiveness of each method by comparing SensiMix (3+3) with and without applying each method. Experimental results are summarized in Table 4.
10.1371/journal.pone.0265621.t004Effectiveness of the three proposed 1bit quantizationaware training methods.
ABWR, PT, and ILF improve the performance (average score in the GLUE tasks) of SensiMix by 1.1%, 1.4%, and 1.4%, respectively.
Methods
Avg.
QQP (Acc)
QNLI (Acc)
SST2 (Acc)
MRPC (F1)
SensiMix
88.4
89.6±0.2
86.5±0.2
90.3±0.1
87.2±0.2
SensiMix without ABWR
87.3
89.2±0.2
85.8±0.1
88.8±0.3
85.4±0.4
SensiMix without PT
87.0
89.1±0.4
85.1±0.4
88.7±0.3
85.2±0.4
SensiMix without ILF
87.0
89.0±0.3
85.2±0.1
88.8±0.2
85.1±0.5
Effectiveness of Absolute Binary Weight Regularization (ABWR)
Introducing ABWR in the loss function increases the accuracy by 0.4%, 0.7%, 1.5%, and 1.8% in QQP, QNLI, SST2, and MRPC tasks, respectively, compared to without introducing it, showing an average of 1.1% improvement in accuracy. These consistent results show that ABWR is capable of minimizing the accuracy drop when 1bit quantization is applied to the model.
Effectiveness of Prioritized Training (PT)
Applying PT in the training phase improves the accuracy of SensiMix in QQP, QNLI, SST2, and MRPC tasks by 0.5%, 1.4%, 1.6%, and 2.0%, respectively. PT improves the accuracy of the model by an average of 1.4% in the four GLUE tasks, which proves the effectiveness of PT.
Effectiveness of Inverse Layerwise Finetuning (ILF)
ILF enhances the accuracy of SensiMix in QQP, QNLI, SST2, and MRPC tasks by 0.6%, 1.3%, 1.5%, and 2.1%, respectively. Overall, ILF increases the accuracy of the model in all the tasks by an average of 1.4%, which validates the effectiveness of our method ILF.
Sensitivity analysis (Q3)
We mentioned that different parts of BERT have different degree of sensitivity to quantization. We experimentally examine the sensitivity of different parts of the model.
Sensitivity of SelfAttention layer and FFN
We compare the sensitivity of SelfAttention layer and FFN in an encoder. For the purpose, we compare two models. The first model, which is SensiMix, applies 8bit index quantization to the SelfAttention layer and 1bit quantization to FFN. The second model applies 1bit quantization to the SelfAttention layer and 8bit index quantization to FFN, which is opposite to the original SensiMix. SensiMix (3+3) is chosen as the representative model for the study. Table 5 shows that SensiMix with 1bit SelfAttention layers shows lower accuracy by an average of 3% compared to SensiMix with 1bit FFN. We conclude that the SelfAttention layer is more sensitive to 1bit quantization than FFN in an encoder.
10.1371/journal.pone.0265621.t005Comparison of the sensitivity of SelfAttention layer and FFN in BERT.
The result indicates that SelfAttention (SA) layer is more sensitive than FFN.
Methods
Avg.
QQP (Acc)
QNLI (Acc)
SST2 (Acc)
MRPC (F1)
SensiMix 1bit FFN
88.4
89.6±0.2
86.5±0.2
90.3±0.1
87.2±0.2
SensiMix 1bit SA layer
85.4
86.5±0.2
84.5±0.3
87.5±0.2
83.0±0.1
Sensitivity of different encoder layers
We compare the sensitivity of different encoder layers in BERT. For the purpose, we compare the following three models, SensiMix, SensiMixL, and SensiMixE. Assuming we use (3+3) model with three 8bit encoders and three MP encoders, SensiMix applies MP encoders to the upper three layers (close to the output layer), SensiMixL applies MP encoders to the lower three layers, and SensiMixE applies MP encoders to the evennumbered layers 2, 4, and 6. Table 6 shows that the original SensiMix (3+3), which deploys MP encoders to the upper three layers, achieves the best accuracy. This demonstrates that the encoder layers near the output layer are relatively less sensitive to quantization compared to the other layers.
10.1371/journal.pone.0265621.t006Comparison of the sensitivity of different encoder layers.
SensiMix applies MP encoders to the upper three layers, SensiMixL applies MP encoders to the lower three layers, and SensiMixE applies MP encoders to the evennumbered layers 2, 4, and 6. SensiMix shows the best accuracy.
Methods
Avg.
QQP (Acc)
QNLI (Acc)
SST2 (Acc)
MRPC (F1)
SensiMix
88.4
89.6±0.2
86.5±0.2
90.3±0.1
87.2±0.2
SensiMixL
85.1
86.7±0.2
84.2±0.4
87.0±0.2
82.3±0.3
SensiMixE
85.4
87.5±0.2
84.2±0.1
87.3±0.1
82.5±0.3
Conclusion
We propose SensiMix, a novel sensitivityaware mixed precision quantization method for BERT compression. SensiMix quantizes the sensitive and insensitive parts of the model to 8bit and 1bit, respectively, to maximize the compression rate and minimize the accuracy drop. For the 8bit quantization, we exploit 8bit index quantization that uses 8bit indices to efficiently represent all weights of each layer of the model, which reduces the model size and keeps the similar accuracy of BERT. For the 1bit quantization, we propose three novel 1bit training methods, Absolute Binary Weight Regularization (ABWR), Prioritized Training (PT), and Inverse Layerwise Finetuning (ILF) to minimize the accuracy drop. For fast inference, we apply FP16 general matrix multiplication (GEMM) and XNORCount GEMM to the 8bit and 1bit parts of the model, respectively. Experiments show that SensiMix provides the best compression rate and the fastest inference speed while maintaining the similar accuracy of BERT in four GLUE benchmark tasks. Future works include deploying SensiMix on various small devices, comparing to other BERT quantization methods, designing advanced BERT quantization methods, and exploring how to effectively combine genetic algorithms with SensiMix that gives better compression, retains more accuracy, and achieves faster inference speed.
ReferencesDevlin J, Chang M, Lee K, Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In: NAACLHLT 2019, Minneapolis, MN, USA, June 27, 2019. Association for Computational Linguistics; 2019. p. 4171–4186.LiuY, OttM, GoyalN, DuJ, JoshiM, ChenD, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, et al. ERNIE 2.0: A Continual PreTraining Framework for Language Understanding. In: AAAI 2020, New York, NY, USA, February 712, 2020; 2020. p. 8968–8975.Kovaleva O, Romanov A, Rogers A, Rumshisky A. Revealing the Dark Secrets of BERT. In: EMNLPIJCNLP 2019, Hong Kong, China, November 37, 2019. Association for Computational Linguistics; 2019. p. 4364–4373.Han S, Pool J, Tran J, Dally WJ. Learning both Weights and Connections for Efficient Neural Network. In: NIPS 2015, December 712, 2015, Montreal, Quebec, Canada; 2015. p. 1135–1143.Han S, Mao H, Dally WJ. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In: ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings; 2016.HintonGE, VinyalsO, DeanJ. Distilling the Knowledge in a Neural Network. Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard AG, et al. Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference. In: CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018; 2018. p. 2704–2713.Jang JG, Quan C, Lee HD, Kang U. FALCON: Lightweight and Accurate Convolution; 2020.Yoo J, Cho M, Kim T, Kang U. Knowledge Extraction with No Observable Data. In: NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada; 2019. p. 2701–2710.KimJ, JungJ, KangU. Compressing deep graph convolution network with multistaged knowledge distillation. SanhV, DebutL, ChaumondJ, WolfT. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Gordon MA, Duh K, Andrews N. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In: Proceedings of the 5th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2020, Online, July 9, 2020; 2020. p. 143–155.Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Selfsupervised Learning of Language Representations. In: ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020. OpenReview.net; 2020.ZafrirO, BoudoukhG, IzsakP, WasserblatM. Q8BERT: Quantized 8Bit BERT. LeeHD, LeeS, KangU. AUBER: Automated BERT regularization. Cho I, Kang U. PeaKD: Parameterefficient and Accurate Knowledge Distillation on BERT; 2020.Hu P, Dong L, Zhan Y. BERT Pretraining Acceleration Algorithm Based on MASK Mechanism. In: Journal of Physics: Conference Series. vol. 2025. IOP Publishing; 2021. p. 012038.Liu Z, Li G, Cheng J. Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing. arXiv preprint arXiv:210302800. 2021;.Zhou X, Ma R, Gui T, Tan Y, Zhang Q, Huang X. PlugTagger: A Pluggable Sequence Labeling Framework Using Language Models. arXiv preprint arXiv:211007331. 2021;.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: NIPS 2017, 49 December 2017, Long Beach, CA, USA; 2017. p. 5998–6008.He Y, Lin J, Liu Z, Wang H, Li L, Han S. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In: ECCV 2018, Munich, Germany, September 814, 2018; 2018.Liu Z, Sun M, Zhou T, Huang G, Darrell T. Rethinking the Value of Network Pruning. In: ICLR 2019, New Orleans, LA, USA, May 69, 2019; 2019.Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: A MultiTask Benchmark and Analysis Platform for Natural Language Understanding. In: BlackboxNLP@EMNLP. Association for Computational Linguistics; 2018. p. 353–355.Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In: ISCA 2016, Seoul, South Korea, June 1822, 2016. IEEE Computer Society; 2016. p. 243–254.Ullrich K, Meeds E, Welling M. Soft WeightSharing for Neural Network Compression. In: ICLR 2017, Toulon, France, April 2426, 2017; 2017.Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. FitNets: Hints for Thin Deep Nets. In: ICLR 2015, San Diego, CA, USA, May 79, 2015; 2015.LiuX, HeP, ChenW, GaoJ. Improving MultiTask Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. ChengY, WangD, ZhouP, ZhangT. A Survey of Model Compression and Acceleration for Deep Neural Networks. Zhu C, Han S, Mao H, Dally WJ. Trained Ternary Quantization. In: ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings; 2017.Banner R, Nahshan Y, Soudry D. Post training 4bit quantization of convolutional networks for rapiddeployment. In: NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada; 2019. p. 7948–7956.Hubara I, Courbariaux M, Soudry D, ElYaniv R, Bengio Y. Binarized Neural Networks. In: NIPS 2016, December 510, 2016, Barcelona, Spain; 2016. p. 4107–4115.Jin J, Liang C, Wu T, Zou L, Gan Z. KDLSQBERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization. arXiv preprint arXiv:210105938. 2021;.Esser SK, McKinstry JL, Bablani D, Appuswamy R, Modha DS. Learned step size quantization. arXiv preprint arXiv:190208153. 2019;.Zhao Z, Liu Y, Chen L, Liu Q, Ma R, Yu K. An Investigation on Different Underlying Quantization Schemes for Pretrained Language Models. In: CCF International Conference on Natural Language Processing and Chinese Computing. Springer; 2020. p. 359–371.Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K. Ibert: Integeronly bert quantization. arXiv preprint arXiv:210101321. 2021;.Lee D, Cho M, Lee S, Song J, Choi C. Datafree mixedprecision quantization using novel sensitivity metric. arXiv preprint arXiv:210310051. 2021;.Yuan Y, Chen C, Hu X, Peng S. Evoq: Mixed precision quantization of dnns via sensitivity guided evolutionary search. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE; 2020. p. 1–8.Jawahar G, Sagot B, Seddah D. What Does BERT Learn about the Structure of Language? In: ACL 2019, Florence, Italy, July 28 August 2, 2019. Association for Computational Linguistics; 2019. p. 3651–3657. Available from: https://doi.org/10.18653/v1/p191356.ClarkK, KhandelwalU, LevyO, ManningCD. What Does BERT Look At? An Analysis of BERT’s Attention. Lin Y, Tan YC, Frank R. Open Sesame: Getting inside BERT’s Linguistic Knowledge. In: Proceedings of the 2019 ACL Workshop BlackboxNLP. Florence, Italy: Association for Computational Linguistics; 2019. p. 241–253.IEEE. IEEE Standard for FloatingPoint Arithmetic. IEEE Std 7542019 (Revision of IEEE 7542008). 2019; p. 1–84. https://doi.org/10.1109/IEEESTD.2019.876622910.1371/journal.pone.0265621.r001Decision Letter 0ConsoliSergioAcademic Editor2022Sergio ConsoliThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version0
5 Oct 2021
PONED2127639SensiMix: SensitivityAware 8bit Index & 1bit Value Mixed Precision Quantization for BERT CompressionPLOS ONE
Dear Dr. Kang,
Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
In overall, this is a rich and interesting article. It is well written and described. The authors should further improve it by following all the minor comments provided by the three reviewers to meet PLOS ONE publication criteria.
Please take carefully into account the comments of all the referees for improving the manuscript to meet the required standards by PLOS ONE before resubmitting it to the journal.
Please submit your revised manuscript by Nov 19 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A markedup copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.
If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submissionguidelines#loclaboratoryprotocols. Additionally, PLOS ONE offers an option for publishing peerreviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorialemail&utm_source=authorletters&utm_campaign=protocols.
We look forward to receiving your revised manuscript.
Kind regards,
Sergio Consoli
Academic Editor
PLOS ONE
Journal Requirements:
1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at
https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and
2. Thank you for stating the following financial disclosure:
[This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020000894, Flexible and Efficient Model Compression Method for Various Applications and Environments, and No.2017001772, Development of QA systems for Video Story Un derstanding to pass the Video Turing Test). The Institute of Engineering Research and ICT at Seoul National University provided research facilities for this work.]
Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."
If this statement is not correct you must amend it as needed.
Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.
3. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 2 in your text; if accepted, production will need this reference to link the reader to the Table.
4. Please include a copy of Table ?? which you refer to in your text on pages 13 and 14.
5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.
[Note: HTML markup is below. Please do not edit.]
Reviewers' comments:
Reviewer's Responses to Questions
Comments to the Author
1. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #1: Yes
Reviewer #2: Yes
Reviewer #3: Yes
**********
2. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #1: Yes
Reviewer #2: Yes
Reviewer #3: Yes
**********
3. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
Reviewer #1: Yes
Reviewer #2: Yes
Reviewer #3: No
**********
4. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
Reviewer #1: Yes
Reviewer #2: Yes
Reviewer #3: No
**********
5. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #1: The research presented here is very interesting, and I'm excited to see such nice size and speed gains through smart use of quantization. I feel that the manuscript is mostly clear and concise, however there area few areas I would like to see clarified or elaborated on.
I'm confused by the terms "8bit index" and "1bit value". If I am understanding correctly, both 8bit and 1bit are quantization strategies on the weights matrices. I'm not sure what the index in "8bit index" refers to, or why it's different from the value in "1bit value", other than in bitwidth. If the difference between the two strategies is only the amount of quantization (and which layers you apply them to), I would suggest you either name them similarly to reflect this (8bit value and 1bit value quantization), or just refer to them as 8bit quantization and 1bit quantization.
In table 1, you state that matrix Wldq is an "8bit indexdequantized weight matrix of the layer l". However, my understanding is that your dequantization process takes a matrix of INT8s and (with the min and max) regenerates an approximation of the original matrix of 32FPs. Shouldn't this matrix contain 32bit values? If not, why not?
This also leads to confusion in 8bit dequantization, where the dequantized values are weight e.g. 1.28, 0.005, 1.00, and 1.27, but Wldq is said to be an 8bit matrix. How can you store a value of 1.28 inside a INT8?
Line 312 then describes it as an 8bit clip function, but if the input to alg. 4 was indeed an 8bit signed integer, the clip function should have no effect. The only way I can make sense of this is if x is actually a FP32, but that doesn't agree with what you've previously written about Wldq.
Line 268: "8bit index quantization consists of two steps: quantization and dequantization."
This sentence is very confusing. I understand that both quantization and dequantization are elaborated on later, but I find it confusing to have step 1 of quantization be its self, and step 2 be undoing its self. I think what you're getting at is something like,
"In order to utilize 8bit quantization, SensiMix is able to quantize 32FP values to 8bit int values, and dequantize the values back to 32bit.", but elaboration would make this clearer.
I have some issues with the differentiability of your clipping functions. In the 8bit clip, if the input are FP32 values, values outside of the range (128,127) will be clipped to 128 and 127. However, the clip function is undifferentiable at this values. Why is this not a problem? You mention in the 1bit clip section that most values are near zero, and are unlikely to be near the clipping limits, but I don't have a similar kind of intuition about the 8bit clipping. I would appreciate if you would elaborate about why both clip functions being undifferentiable at their min and max is not a problem (or if it is, explain).
Figure 3. I'm really not sure what this figure is trying to convey. Shouldn't the weights after ABWR be clustered at 1 and 1? To my understanding of ABWR, the left and right subfigures seem swapped; the values before ABWR should be clustered around 0, and afterwards they should be clustered around 1 and 1. Could you explain why Fig 3 shows what it does?
392: Figure 4 seems to show a layer FP32 FFN being converted into a 1bit FFN after 1 epoch, but that doesn't really agree with your description of adding MP layers to the bottom (I assume the bottom of the existing MP encoder layers, but you don't specify this). Which is correct for ILF?

small details
Line 43: you might want to describe 1bit quantization here, or cite a description. I assume it quantizes values to 1 or 1, but a reader could easily assume it quantizes it to 0 or 1.
"For the 1bit quantization, we apply standard 1bit quantization." This sentence adds nothing, either explain what 1bit quantization is, or remove this sentence.
Line 146: suggest changing "lowbit numbers" to something like "reduced precision numbers", or "smaller bitwidth integers", since lowbit is ambiguous, and to my ears "lowbit" refers to the position of a bit, not the width of an integer.
452: a table link is broken.
520: broken table link.
Reviewer #2: The authors present a method for adaptive weight quantization for the BERT model. The results support the novelty of their method and the paper is wellwritten and easy to follow. Here are a few comments to help the authors revise the manuscript:
1) Outside the BERT domain there seems to be a body of literature on Mixed Precision Quantization based on the sensitivity of the outputs. e.g. these came up with a simple search: https://ieeexplore.ieee.org/document/9207413, https://arxiv.org/abs/2103.10051
2) There also seems to be more literature on quantization of BERT models e.g. https://arxiv.org/abs/2101.01321, https://arxiv.org/abs/2101.05938, https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15742249.pdf, https://arxiv.org/pdf/2010.07109.pdf which seems to be ignored by the authors.
Could the authors elaborate more on this to stress their novelty?
3) Line 452 and 520, 538 reference to Table is missing.
4) In the questions outlined in Experiments, the section numbering is missing.
Reviewer #3: (1) The term BERT should be expounded first before first time usage.
(2) The introduction was well prepared (well done!)
(3) The section on related works should be expanded.
(4) The section " Related Work " should be " Related Works ".
(5) The proposed method SENSIMIX should be expanded to include algorithmic description in form of flowhart or pseudocode. This is necessary to see the actual operation of the SENSIMIX
(6) The authors did nice work with textual description, but the work should be expanded to include diagramatic illustrations
(7) Is there anyway that Genetic Algorithm (GA) can be used for model reduction in this work ?
(8) Both 8bit and MP encoders should be described with a flowchart or pseudocode.
(9) How is FB32 matrix in line 276 represented ? (In terms of matrix)
(10) Why the use of the method in Equation 6? Is it better than ADAM method, or SGM ?
(11) In overall, this is a very rich article and well written and described. The authors should include more diagrams and include another section on the description of the computational platform (software) used.
**********
6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #1: No
Reviewer #2: No
Reviewer #3: Yes: Oluleye Hezekiah Babatunde
[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
10.1371/journal.pone.0265621.r002Author response to Decision Letter 0Submission Version1
15 Nov 2021
1. Response to Reviewer 1
We would like to thank Reviewer 1 for the insightful and detailed feedback. Comments and how we address them in our paper are summarized below.
[Comment 1]
• I’m confused by the terms ”8bit index” and ”1bit value”. If I am understanding correctly, both 8bit and 1bit are quantization strategies on the weights matrices. I’m not sure what the index in ”8bit index” refers to, or why it’s different from the value in ”1bit value”, other than in bitwidth. If the difference between the two strategies is only the amount of quantization (and which layers you apply them to), I would suggest you either name them similarly to reflect this (8bit value and 1bit value quantization), or just refer to them as 8bit quantization and 1bit quantization.
[Response to comment 1]
Q. About the difference between the index quantization and the value quantization.
R. The main difference between the 8bit index quantization and the 1bit value quantization depends on whether there is a dequantization process, in addition to the bit sizes. The 1bit value quantization does not have a dequantization process, and all weights are quantized to ±1 and propagated to the next layer directly. However, the 8bit index quantization has a dequantization process where FP32 values are first quantized to INT8 indices (we call them indices because they are not the final values that are propagated to the next layer) and then dequantized back to FP32 values before propagated to the next layer. In the revised paper, we added a precise description of the difference between the value and the index quantization to lines 343345.
[Comment 2]
• In table 1, you state that matrix Wldq is an ”8bit indexdequantized weight matrix of the layer l”. However, my understanding is that your dequantization process takes a matrix of INT8s and (with the min and max) regenerates an approximation of the original matrix of 32FPs. Shouldn’t this matrix contain 32bit values? If not, why not? This also leads to confusion in 8bit dequantization, where the dequantized values are weight e.g. 1.28, 0.005, 1.00, and 1.27, but Wldq is said to be an 8bit matrix. How can you store a value of 1.28 inside a INT8? Line 312 then describes it as an 8bit clip function, but if the input to alg. 4 was indeed an 8bit signed integer, the clip function should have no effect. The only way I can make sense of this is if x is actually a FP32, but that doesn’t agree with what you’ve previously written about Wldq.
[Response to comment 2]
Q. About the definition of the matrix Wldq
R. It is right that Wldq is the dequantized FP32 weight matrix which contains FP32 values, and we stated this matrix as ”8bit indexdequantized weight matrix of the layer l”, which might bring misunderstandings to readers. In the revised paper, we revised the definitions of Wlq and Wldq in Table 1 to make the statement more clear.
[Comment 3]
• Line 268: ”8bit index quantization consists of two steps: quantization and dequantization.” This sentence is very confusing. I understand that both quantization and dequantization are elaborated on later, but I find it confusing to have step 1 of quantization be its self, and step 2 be undoing its self. I think what you’re getting at is something like, ”In order to utilize 8bit quantization, SensiMix is able to quantize 32FP values to 8bit int values, and dequantize the values back to 32bit.”, but elaboration would make this clearer.
[Response to comment 3]
Q. About the elaboration of the process of the 8bit index quantization.
R. It is right that the quantization step quantizes FP32 values to INT8 values (indices), and the dequantization step dequantizes INT8 values back to FP32 values. In the revised paper, we added a precise elaboration about this part to lines 279281.
[Comment 4]
• I have some issues with the differentiability of your clipping functions. In the 8bit clip, if the input are FP32 values, values outside of the range (128,127) will be clipped to 128 and 127. However, the clip function is undifferentiable at this values. Why is this not a problem? You mention in the 1bit clip section that most values are near zero, and are unlikely to be near the clipping limits, but I don’t have a similar kind of intuition about the 8bit clipping. I would appreciate if you would elaborate about why both clip functions being undifferentiable at their min and max is not a problem (or if it is, explain).
[Response to comment 4]
Q1. About the intuition of the 8bit clip function .
R1. We use the 8bit clip function to replace the 8bit round function in the training process to solve the zero gradient problem of the 8bit round function. The derivative of the 8bit round function is equal to zero in almost all ranges which disables the training of the model using gradient descentbased training methods, while replacing it by the 8bit clip function enables the model training.
Q2. About the problem that values outside of the range (128,127).
R2. In the 8bit index quantization, FP32 weights in a weight matrix are divided into 256 ranges which are temporarily stored in INT8 numbers (indices), and then we use these INT8 numbers, the minimum weight, and the maximum weight to approximate the original FP32 weight matrix (dequantization step). Note that in Equation 1, all original FP32 weight values are mapped into range [128, 127], which means all FP32 weights are updated. In the revised paper, we added statements about this in lines 335337 to make it more clear.
[Comment 5]
• Figure 3. I’m really not sure what this figure is trying to convey. Shouldn’t the weights after ABWR be clustered at 1 and 1? To my understanding of ABWR, the left and right subfigures seem swapped; the values before ABWR should be clustered around 0, and afterwards they should be clustered around 1 and 1. Could you explain why Fig 3 shows what it does?
[Response to comment 5]
R. The order of the figures in Figure 3 was wrong, we corrected the order of the figures in the revised paper.
[Comment 6]
• 392: Figure 4 seems to show a layer FP32 FFN being converted into a 1bit FFN after 1 epoch, but that doesn’t really agree with your description of adding MP layers to the bottom (I assume the bottom of the existing MP encoder layers, but you don’t specify this). Which is correct for ILF?
[Response to comment 6]
R. It is right that the ”bottom” represents the bottom of the existing MP encoder layers. To make the statement more clear, we revised the lines 407, 408, 410, and 411 which give a more precise description of ILF in the revised paper.
[Comment 7]
• Line 43: you might want to describe 1bit quantization here, or cite a description. I assume it quantizes values to 1 or 1, but a reader could easily assume it quantizes it to 0 or 1. ”For the 1bit quantization, we apply standard 1bit quantization.” This sentence adds nothing, either explain what 1bit quantization is, or remove this sentence.
[Response to comment 7]
R. We removed the sentence in the revised paper.
[Comment 8]
• Line146: suggest changing ”lowbit numbers”to something like”reduced precision numbers”, or ”smaller bitwidth integers”, since lowbit is ambiguous, and to my ears ”lowbit” refers to the position of a bit, not the width of an integer.
[Response to comment 8]
R. We changed the words ”lowbit numbers” to ”smaller bitwidth integers” in the revised paper.
[Comment 9]
• 452: a table link is broken. 520: broken table link.
[Response to comment 9]
R. We fixed the broken reference links in the revised paper.
2. Response to Reviewer 2
We would like to thank Reviewer 2 for the highquality review and constructive comments. Comments and how we address them in our paper are summarized below.
[Comment 1]
• 1) Outside the BERT domain there seems to be a body of literature on Mixed Precision Quantization based on the sensitivity of the outputs. e.g. these came up with a simple search: [5, 4] 2) There also seems to be more literature on quantization of BER T models e.g. [3, 2, 1, 6]. Could the authors elaborate more on this to stress their novelty?
[Response to comment 1]
Q1. About the contribution and novelty of our paper.
R1. Our main claim is to compress the BERT model to a small one while maintaining its accuracy, and we have the following contributions:
– We discover the sensitivity of different modules of BERT. We demonstrate that the encoders close to the input layer are more sensitive than those near to the output layer in BERT, and SelfAttention layer is more sensitive than feedforward network (FFN) in an encoder.
– We effectively compress the model considering the sensitivity of different modules. We compress the sensitive parts of BERT using 8bit index quantization and insensitive parts using 1bit value quantization, which maximizes the compression rate while minimizing the accuracy degradation caused by quantizing sensitive modules to low bits.
– We propose three 1bit quantizationaware training methods: Absolute Binary Weight Regularization (ABWR), Prioritized Training (PT), and Inverse Layerwise Finetuning (ILF). They overcome the challenges of training 1bit quantization method, improving the accuracy of the model.
– We introduce a fast inference strategy of SENSIMIX on an NVIDIA GPU. We apply FP16 general matrix multiplication (GEMM) and XNORCount GEMM to replace the original GEMM for the 8bit and the 1bit quantization parts of the model, respectively, to achieve fast inference speed.
Q2. About comparisons with other mixed precision quantization methods.
R2. We compared our method to Q8BERT which is an effective quantization method for BERT. Other methods are excluded for the following reasons. First, there are difficulties when comparing our method to quantization methods designed on special hardware because the performance changes when the hardware environment changes. Second, methods that are designed for other models (MLP, CNN, and RNN) are not directly applicable to BERT, such as [5] and [4]. However, we agree that comparing to other mixed precision BERT quantization methods is one of the next steps. We added the comment in line 635 in the revised paper.
[Comment 2]
• Line 452 and 520, 538 reference to Table is missing.
[Response comment 2]
R. We fixed the broken reference links in the revised paper.
[Comment 3]
• In the questions outlined in Experiments, the section numbering is missing.
[Response to comment 3]
R. In the revised paper, we removed the section numbers in the experimental questions and added the corresponding question numbers to the title of each section.
3. Response to Reviewer 3
We would like to thank Reviewer 3 for the insightful and detailed comments. Comments and our responses are summarized below.
[Comment 1]
• The term BERT should be expounded first before first time usage.
[Response to comment 1]
R. In the revised paper, we added an introduction of BERT in the first subsection of Related Works (lines 8492) to expound the term BERT.
[Comment 2]
• The introduction was well prepared (well done!)
[Response to comment 2]
R. Thank you for your kind comment.
[Comment 3]
• The section on related works should be expanded.
[Response to comment 3]
R. We added an introduction of BERT to the first subsection of Related Works (lines 8492). This subsection expands the related works and provides a clear elaboration of BERT to readers.
[Comment 4]
• The section ”Related Work” should be ”Related Works”.
[Response to comment 4]
R. We revised the term ”Related Work” to ”Related Works” in the revised paper.
[Comment 5]
• The proposed method SENSIMIX should be expanded to include algorithmic description in form of flowchart or pseudocode. This is necessary to see the actual operation of the SENSIMIX
[Response to comment 5]
R. In the revised paper, we added a pseudocode algorithm of SENSIMIX (Algorithm 1) in Proposed Method.
[Comment 6]
• The authors did nice work with textual description, but the work should be expanded to include diagramatic illustrations
[Response to comment 6]
R. In the revised paper, we added a pseudocode algorithm of SENSIMIX (Algorithm 1) and a figure (Fig 3) on the process of the 8bit index quantization in Proposed Method to expand diagramatic illustrations of our method.
[Comment 7]
• Is there anyway that Genetic Algorithm (GA) can be used for model reduction in this work ?
[Response to comment 7]
R. We have not combined Genetic Algorithms with our method. However, we added it to our future work to lines 636637 in the revised paper.
[Comment 8]
• Both 8bit and MP encoders should be described with a flowchart or pseudocode.
[Response to comment 8]
R. In the revised paper, we added a pseudocode algorithm of SENSIMIX (Algorithm 1) to Proposed Method which includes the algorithms of 8bit and MP encoders. Furthermore, we added a figure (Fig 3) on the process of the 8bit index quantization to Proposed Method to give a diagramatic explanation of the 8bit encoder.
[Comment 9]
• How is FB32 matrix in line 276 represented ? (In terms of matrix)
[Response to comment 9]
R. The FP32 matrix in your comment is the fullprecision weight matrix of each layer in the model (corresponding to Wlfp in Table 1). In the revised paper, we gave a more clear elaboration to lines 289290.
[Comment 10]
• Why the use of the method in Equation 6? Is it better than ADAM method, or SGM?
[Response to comment 10]
R. Equation6 is a basic training rule to solve the zero gradient problem of the binary function; other gradient descentbased update rules (e.g., Adam) can be used as well. To make the statement more clear, we added an explanation of Equation 6 to lines 367369 in the revised paper.
[Comment 11]
• In overall, this is a very rich article and well written and described. The authors should include more diagrams and include another section on the description of the computational platform (software) used.
[Response to comment 11]
R. Thank you for your valuable comments. In the revised paper, we added a pseudocode algorithm of SENSIMIX (Algorithm 1) and an illustration figure of the process of the 8bit index quantization (Fig 3) in Proposed Method, and we added a description of our software environment to lines 530531.
[3] Kim, S., Gholami, A., Yao, Z., Mahoney, M.W., Keutzer, K.: Ibert: Integeronly bert quantization (2021)
[4] Lee, D., Cho, M., Lee, S., Song, J., Choi, C.: Datafree mixedprecision quantization using novel sensitivity metric. CoRR abs/2103.10051 (2021). URL https://arxiv.org/abs/2103.10051
[5] Yuan, Y., Chen, C., Hu, X., Peng, S.: Evoq: Mixed precision quantization of dnns via sensitivity guided evolutionary search. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020). DOI 10.1109/IJCNN48605.2020.9207413
[6] Zhao, Z., Liu, Y., Chen, L., Liu, Q., Ma, R., Yu, K.: An investigation on different underlying quantization schemes for pretrained language models (2020)
Submitted filename: rebuttal.pdf
10.1371/journal.pone.0265621.r003Decision Letter 1ConsoliSergioAcademic Editor2022Sergio ConsoliThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version1
3 Jan 2022
PONED2127639R1SensiMix: SensitivityAware 8bit Index & 1bit Value Mixed Precision Quantization for BERT CompressionPLOS ONE
Dear Dr. Kang,
Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
The paper has improved evidently and the contents are worth of interest for the community. There are however still points to be addressed before the manuscript can reach an acceptable standard level for being published.
In particular make sure to address the comment by R2 who raised a concern about the background section and experimental evaluation relative to other methods in the literature, making thus in discussion the novelty of the proposed method.
If an additional experimental evaluation is not possible, it should be at least reported a full literature review in the background section related to quantization methods for neural nets. Such a literature review should explain the main differences with each method, explaining in case why it was not reported in the comparison by the authors. It should be stressed then also why the proposed SensiMix method is novel relative to these methods.
Please take carefully into account the comments of all the referees for improving the manuscript to meet PLOS ONE standards before resubmitting it to the journal.
Please submit your revised manuscript by Feb 17 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A markedup copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.
If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submissionguidelines#loclaboratoryprotocols. Additionally, PLOS ONE offers an option for publishing peerreviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorialemail&utm_source=authorletters&utm_campaign=protocols.
We look forward to receiving your revised manuscript.
Kind regards,
Sergio Consoli
Academic Editor
PLOS ONE
Journal Requirements:
Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.
[Note: HTML markup is below. Please do not edit.]
Reviewers' comments:
Reviewer's Responses to Questions
Comments to the Author
1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.
Reviewer #2: (No Response)
Reviewer #4: All comments have been addressed
**********
2. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #2: Partly
Reviewer #4: Yes
**********
3. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #2: Yes
Reviewer #4: Yes
**********
4. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
Reviewer #2: Yes
Reviewer #4: Yes
**********
5. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
Reviewer #2: Yes
Reviewer #4: Yes
**********
6. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #2: Unfortunately the authors have not addressed my concerns with regards to comparison with other quantization methods, making it very difficult to judge the actual novelty of the paper. If an experimental evaluation is not possible, I think there should be at least a full literature review in the background section in quantization methods for neural nets. Such a literature review should explain the differences with each one, why it wasn't used for comparison, and why it SensiMix is novel.
Reviewer #4: The authors propose an acceleration of the BERT family of models using quantization techniques. Although the contribution is not difficult to pursue, the results are promising in practice. Therefore, I believe that the paper has merit to be accepted in a multidisciplinary journal such as Plos One.
I suggest introducing and discussing recent references on how to increase the computational efficiency of BERT from different perspectives.
Some examples are
BERT Pretraining Acceleration Algorithm Based on MASK Mechanism
Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing
PlugTagger: A Pluggable Sequence Labeling Framework Using Language Models
A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoTs
**********
7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #2: No
Reviewer #4: No
[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
10.1371/journal.pone.0265621.r004Author response to Decision Letter 1Submission Version2
31 Jan 2022
We tried our best to reflect all the reviewers' comments in our revised manuscript. Thank you for the meaningful comments.
Submitted filename: rebuttal.pdf
10.1371/journal.pone.0265621.r005Decision Letter 2ConsoliSergioAcademic Editor2022Sergio ConsoliThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version2
7 Mar 2022
SensiMix: SensitivityAware 8bit Index & 1bit Value Mixed Precision Quantization for BERT Compression
PONED2127639R2
Dear Dr. Kang,
We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.
Within one week, you’ll receive an email detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.
An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is uptodate. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.
If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible  no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.
Kind regards,
Sergio Consoli
Academic Editor
PLOS ONE
Additional Editor Comments (optional):
Reviewers' comments:
Reviewer's Responses to Questions
Comments to the Author
1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.
Reviewer #2: All comments have been addressed
Reviewer #4: All comments have been addressed
**********
2. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #2: Partly
Reviewer #4: Yes
**********
3. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #2: Yes
Reviewer #4: Yes
**********
4. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
Reviewer #2: Yes
Reviewer #4: Yes
**********
5. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
Reviewer #2: Yes
Reviewer #4: Yes
**********
6. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #2: The authors have added a detailed background section referring to similar works and addressing my earlier concerns.
Reviewer #4: The authors added my suggestions and improve the manuscript in this review round. Congratulations on your work!
**********
7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #2: No
Reviewer #4: No
10.1371/journal.pone.0265621.r006Acceptance letterConsoliSergioAcademic Editor2022Sergio ConsoliThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
8 Apr 2022
PONED2127639R2
SensiMix: SensitivityAware 8bit Index & 1bit Value Mixed Precision Quantization for BERT Compression
Dear Dr. Kang:
I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.
If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.
If we can help with anything else, please email us at plosone@plos.org.
Thank you for submitting your work to PLOS ONE and supporting open access.