In this tutorial, we will explain more detailed usage about the model compression in NNI.

Setup compression goal

Specify the configuration

Users can specify the configuration (i.e., config_list) for a compression algorithm. For example, when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python list object, where each element is a dict object.

The dicts in the list are applied one by one, that is, the configurations in latter dict will overwrite the configurations in former ones on the operations that are within the scope of both of them.

There are different keys in a dict. Some of them are common keys supported by all the compression algorithms:

  • op_types: This is to specify what types of operations to be compressed. ‘default’ means following the algorithm’s default setting. All suported module types are defined in default_layers.py for pytorch.

  • op_names: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.

  • exclude: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.

Some other keys are often specific to a certain algorithm, users can refer to pruning algorithms and quantization algorithms for the keys allowed by each algorithm.

To prune all Conv2d layers with the sparsity of 0.6, the configuration can be written as:

 'sparsity': 0.6,
 'op_types': ['Conv2d']

To control the sparsity of specific layers, the configuration can be written as:

   'sparsity': 0.8,
   'op_types': ['default']
   'sparsity': 0.6,
   'op_names': ['op_name1', 'op_name2']
   'exclude': True,
   'op_names': ['op_name3']

It means following the algorithm’s default setting for compressed operations with sparsity 0.8, but for op_name1 and op_name2 use sparsity 0.6, and do not compress op_name3.

Quantization specific keys

Besides the keys explained above, if you use quantization algorithms you need to specify more keys in config_list, which are explained below.

  • quant_types : list of string.

Type of quantization you want to apply, currently support ‘weight’, ‘input’, ‘output’. ‘weight’ means applying quantization operation to the weight parameter of modules. ‘input’ means applying quantization operation to the input of module forward method. ‘output’ means applying quantization operation to the output of module forward method, which is often called as ‘activation’ in some papers.

  • quant_bits : int or dict of {str : int}

bits length of quantization, key is the quantization type, value is the quantization bits length, eg.

   quant_bits: {
      'weight': 8,
      'output': 4,

when the value is int type, all quantization types share same bits length. eg.

   quant_bits: 8, # weight or output quantization are all 8 bits
  • quant_dtype : str or dict of {str : str}

quantization dtype, used to determine the range of quantized value. Two choices can be used:

  • int: the range is singed

  • uint: the range is unsigned

Two ways to set it. One is that the key is the quantization type, and the value is the quantization dtype, eg.

   quant_dtype: {
      'weight': 'int',
      'output': 'uint,

The other is that the value is str type, and all quantization types share the same dtype. eg.

   'quant_dtype': 'int', # the dtype of weight and output quantization are all 'int'

There are totally two kinds of quant_dtype you can set, they are ‘int’ and ‘uint’.

  • quant_scheme : str or dict of {str : str}

quantization scheme, used to determine the quantization manners. Four choices can used:

  • per_tensor_affine: per tensor, asymmetric quantization

  • per_tensor_symmetric: per tensor, symmetric quantization

  • per_channel_affine: per channel, asymmetric quantization

  • per_channel_symmetric: per channel, symmetric quantization

Two ways to set it. One is that the key is the quantization type, value is the quantization scheme, eg.

   quant_scheme: {
      'weight': 'per_channel_symmetric',
      'output': 'per_tensor_affine',

The other is that the value is str type, all quantization types share the same quant_scheme. eg.

   quant_scheme: 'per_channel_symmetric', # the quant_scheme of weight and output quantization are all 'per_channel_symmetric'

There are totally four kinds of quant_scheme you can set, they are ‘per_tensor_affine’, ‘per_tensor_symmetric’, ‘per_channel_affine’ and ‘per_channel_symmetric’.

The following example shows a more complete config_list, it uses op_names (or op_types) to specify the target layers along with the quantization bits for those layers.

config_list = [{
   'quant_types': ['weight'],
   'quant_bits': 8,
   'op_names': ['conv1'],
   'quant_dtype': 'int',
   'quant_scheme': 'per_channel_symmetric'
   'quant_types': ['weight'],
   'quant_bits': 4,
   'quant_start_step': 0,
   'op_names': ['conv2'],
   'quant_dtype': 'int',
   'quant_scheme': 'per_tensor_symmetric'
   'quant_types': ['weight'],
   'quant_bits': 3,
   'op_names': ['fc1'],
   'quant_dtype': 'int',
   'quant_scheme': 'per_tensor_symmetric'
   'quant_types': ['weight'],
   'quant_bits': 2,
   'op_names': ['fc2'],
   'quant_dtype': 'int',
   'quant_scheme': 'per_channel_symmetric'

In this example, ‘op_names’ is the name of layer and four layers will be quantized to different quant_bits.

Export compression result

Export the pruned model

You can easily export the pruned model using the following API if you are pruning your model, state_dict of the sparse model weights will be stored in model.pth, which can be loaded by torch.load('model.pth'). Note that, the exported model.pthhas the same parameters as the original model except the masked weights are zero. mask_dict stores the binary value that produced by the pruning algorithm, which can be further used to speed up the model.

# export model weights and mask
pruner.export_model(model_path='model.pth', mask_path='mask.pth')

# apply mask to model
from nni.compression.pytorch import apply_compression_results

apply_compression_results(model, mask_file, device)

export model in onnx format(input_shape need to be specified):

pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])

Export the quantized model

You can export the quantized model directly by using torch.save api and the quantized model can be loaded by torch.load without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.

# Save quantized model which is generated by using NNI QAT algorithm
torch.save(model.state_dict(), "quantized_model.pth")

# Simulate model loading procedure
# Have to init new model and compress it before loading
qmodel_load = Mnist()
optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
quantizer = QAT_Quantizer(qmodel_load, config_list, optimizer)

# Load quantized model

# Get scale, zero_point and weight of conv1 in loaded model
conv1 = qmodel_load.conv1
scale = conv1.module.scale
zero_point = conv1.module.zero_point
weight = conv1.module.weight

Speed up the model

Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking apply_compression_results on your model, your model becomes a smaller one with shorter inference latency.

from nni.compression.pytorch import apply_compression_results, ModelSpeedup

dummy_input = torch.randn(config['input_shape']).to(device)
m_speedup = ModelSpeedup(model, dummy_input, masks_file, device)

Please refer to here for detailed description. The example code for model speedup can be found here

Control the Fine-tuning process

Enhance the fine-tuning process

Knowledge distillation effectively learns a small student model from a large teacher model. Users can enhance the fine-tuning process that utilize knowledge distillation to improve the performance of the compressed model. Example code can be found here