Major Enhancement of Compression in NNI 3.0

To bolster additional compression scenarios and more particular compression configurations, we have revised the compression application programming interface (API) in NNI 3.0. If you are a beginner to NNI Compression, you could bypass this document. Nonetheless, if you have employed NNI Compression before and want to try the latest Compression version, this document will help you in comprehending the noteworthy alterations in the interface in 3.0.

Compression Target

The notion of compression target is a novel concept introduced in NNI 3.0. It refers to the specific parts of a module that should be compressed, such as input, output or weights.

In previous versions, NNI assumed that all module types should have parameters named weight and bias, and only produced masks for these parameters. This assumption was suitable for a significant degree of simulation compression. However, it is undeniable that there are still many modules that do not fit into this assumption, particularly for customized modules.

Therefore, in NNI 3.0, model compression can configure specifically for the level of input, output, and parameters of the module. By means of fine-grained configuration, NNI can not only compress module types that were previously uncompressible, but also achieve better simulation compression. As a result, the gap in accuracy between simulation compression and real speedup becomes extremely small.

For instance, in previous versions, the operation of softmax would significantly diminish the effect of simulated pruning, since 0 as input is also meaningful for softmax. In NNI 3.0, this can be avoided by setting the input and output masks and apply_method to ensure that softmax obtains the correct simulated pruning result.

Please consult the sections on target_names and target_settings for further details.

Compression Mode

In the previous version of NNI (lower than 3.0), three pruning modes were supported: normal, global, and dependency-aware.

In the normal mode, each module was required to be assigned a sparse ratio, and the pruner generated masks directly on the weight elements of this ratio.

In the global mode, a sparse ratio was set for a group of modules, and the pruner generated masks whose overall sparse ratio conformed to the setting, but the sparsity of each module in the group may differ.

The dependency-aware mode constrained modules with operational dependencies to generate related masks.

For instance, if the outputs of two modules had an add relationship, then the two modules would have the same masks in the output dimension.

Different modes were better suited to different compression scenarios to achieve improved compression effects. Nevertheless, we believe that more flexible combinations should be allowed. For example, in a compression process, certain modules of similar levels could apply the overall sparse ratio, while other modules with operational dependencies could generate similar masks at the same time.

Right now in NNI 3.0, users can directly set global_group_id and dependency_group_id to implement global and dependency-aware modes. Additionally, align is supported to generate a mask from another module mask, such as generating a batch normalization mask from a convolution mask. You can achieve improved performance and exploration by combining these modes by setting the appropriate keys in the configuration list.

Pruning Speedup

The previous method of pruning speedup relied on torch.jit.trace to trace the model graph. However, this method had several limitations and required additional support to perform certain operations. These limitations resulted in excessive maintenance costs, making it difficult to continue development.

To address these issues, in NNI 3.0, we refactored the pruning speedup based on concrete_trace. This is a useful utility for tracing a model graph, based on torch.fx. Unlike torch.fx.symbolic_trace, concrete_trace executes the entire model, resulting in a more complete graph. As a result, most operations that couldn’t be traced in the previous pruning speedup can now be traced.

In addition to concrete_trace, users who have a good torch.fx.GraphModule for their traced model can also use the torch.fx.GraphModule directly. Furthermore, the new pruning speedup supports customized masks propagation logic and module replacement methods to cope with the speedup of various customized modules.

Model Fusion

Model fusion is supported in NNI 3.0. You can use it easily by setting fuse_names in each configure in the config_list. Please refer Module Fusion for more details.

Distillation

Two distillers is supported in NNI 3.0. By pruning or quantization fused distillation, it can get better compression results and higher precision.

Please refer Distiller for more details.

Fusion Compression

Thanks to the new unified compression framework, it is now possible to perform pruning, quantization, and distillation simultaneously, without having to apply them one by one.

Please refer fusion compression for more details.