Uncategorised

The Nice And Accurate Model Compression Post

Recently I’ve been looking into model compression, and the next few posts will look into the SoTA for that area of research.

In this post I will cover some general topics such as what is model compression and how I think when I look at it.

What is model compression

The subject of model compression focuses on compressing (duh) models (double duh!). This definition is useful to differentiate it from other types of compression where our focus in on compressing the data (JPG for images, gzip for general information…). We are interested in compressing the actual model we are using to process the data, and we do it for the same reasons we compress things in general – to reduce the memory footprint, the runtime and resource consumption of our algorithms.

The core ideas in compression are directly linked to Information Theory and while I do not plan to delve into that it is useful to be aware of the tools developed there and in general adopt the frame of mind which considers the representational power of a model.

Dealing with abstract notions such as “representation power” or [Insert other information theory jargon] can very easily obscure the underlying mechanisms you are working with, so for the first, and definitely not the last, time I will say – there is no such thing as a free lunch. If it is too good to be true, it is almost always false.

Where would we use model compression?

Extremely large models

As models and datasets scale in size and complexity their associated costs skyrocket. With larger datasets comes higher hosting costs, requirements for better infrastructure and network bandwidths. Larger models require more, and larger, GPUs as well as higher infrastrcture and energy costs. Finally the combination of large models with large datasets means that even if you have access to insanely expensive corporate compute clusters you would still spend extremely long time when training the model, grinding your research to a halt.

Long inference times, hosting and bandwidth costs mean that even when deploying a model on the cloud we would still aim to make it as efficient as possible.

Deploying models on edge-devices

Edge devices are devices with limited resources – limited memory, limited compute power and often limited physical space. Even if we could fit a large model on your phone it would heat it up to the point of melting, or consume all of the available RAM or simply take hours to process that image you just took.
Compressing the models allows us to fit models which can run in realtime and don’t use up your entire battery.
More extreme cases exists where you might be limited to extremely small space (<256K!) or have extremly limited compute throughput.

Main approaches

Pruning

As the name suggests, pruning aims to remove the less useful bits of the model, be they individual weights or connections between weights, layers and even larger parts of the architecture.

The first and most important division is structured and unstructured pruning. Structured pruning encompasses methods where we prune a sub-structure of the network which can be excised out, leading to direct reduction in size, resource requirements and runtime. Unstructured pruning is everything else. It is easier to understand this division from the point of view of weight pruning.

If we want to remove the i-th weight of a given linear matrix, we would have to change the shape of the weight matrix which makes it incompatible with the next layer as well. Instead what we would do is zero that specific weight out, either by writing over the weight value or, more commonly, by using a mask to zero its contribution on the forward pass. While this does not reduce the size of the model, it does make gradient calculations much faster and with proper implementation can reduce runtime and resource requirements.

Other than how to prune, we also need to decide what to prune. How do we choose which weights / layers etc. we want to remove?

Quantization

The main differentiation for a quantization method is at what stage of the training cycle it occurs.

Training a quantized model from scratch: because of the reduced accuracy of the representation it can be very difficult to optimize a quantized model and using quantized gradients gets increasingly unstable the stronger the quantization used. While these approaches exist they are very rarely used.

Post-training quantization:

Given a trained model we can quantize the learned weights directly. To maintain fidelity to the training or data distributions some weights condition the quantization on some data or even on a forward pass, while many other methods quantize with respect ot the weight distribution only.

Note that unless we use a data-aware approach it is not possible to accurately quantize the weights with respect to the activation layers which has been shown to lead to worse performance.

Quantization-aware Training (QAT): If we know that we want to quantize the model when training it, we might as well introduce it into the training scheme so the model can learn to compensate for it. The same problems we had when training a quantized model from scratch appear here, unless we take the extra step of quantized and dequantized the model between training batches.

Another option is to use pseudo-quantization, where we apply the quantization and the training while keeping all parameters in a float format.

There are many other nuances in the quantization realm which I hope to cover in a future post.

Model Decomposition

Applying a single linear layer $l \in R^{m, n}$ has a runtime and memory complexity of $O(mn)$, but theoretically we can replace it by applying two layers, $l_1 \in R^{m, r}, l_2 \in R^{r, n}$ such that the resulting operation is equivalent. To do so we would now consume $O(mr)+O(rn)$ runtime and memory, meaning that for any $r < \dfrac{mn}{m+n}$ we save on both time and memory requirements!

Obviously, there is no such thing as free lunch and the resulting two-layer model has lower expressivity, but can we decompose the original layer in a way that maintains as much of the information learned in it?

This is the exact question which Model decomposition focuses on – looking for a smaller, more efficient model which behaves like the original one. Classic solutions for the example above would look into SVD or random projections to decompose the weights of the layer into two new, smaller, layers.

Knowledge distillation

Knowledge distillation takes ideas from Model decomposition a step further – why try to replace a single block in the network when we can replace the entire thing?

The knowledge distillation paradigm introduces a teacher-student relationship between the larger trained model and the smaller compressed one. The idea is to train the “student” model for whatever task you have in mind while using the “teacher” to guide the convergence of the student.

The very simple approach might be to train the student model as you would any other model, but add a loss term which encourages the features of the last layer (e.g. the logits in a classification model) to conform to the features of the teacher model.

There have been numerous additions to this basic idea – distilling other subnetworks, incorporating task-aware distillation losses, cross-distillation and many more.

Leave a Reply

Your email address will not be published. Required fields are marked *