Major Enhancement of Compression in NNI 3.0¶
To bolster additional compression scenarios and more particular compression configurations, we have revised the compression application programming interface (API) in NNI 3.0. If you are a beginner to NNI Compression, you could bypass this document. Nonetheless, if you have employed NNI Compression before and want to try the latest Compression version, this document will help you in comprehending the noteworthy alterations in the interface in 3.0.
The notion of
compression target is a novel concept introduced in NNI 3.0.
It refers to the specific parts of a module that should be compressed, such as input, output or weights.
In previous versions, NNI assumed that all module types should have parameters named
and only produced masks for these parameters.
This assumption was suitable for a significant degree of simulation compression.
However, it is undeniable that there are still many modules that do not fit into this assumption,
particularly for customized modules.
Therefore, in NNI 3.0, model compression can configure specifically for the level of input, output, and parameters of the module. By means of fine-grained configuration, NNI can not only compress module types that were previously uncompressible, but also achieve better simulation compression. As a result, the gap in accuracy between simulation compression and real speedup becomes extremely small.
For instance, in previous versions, the operation of
softmax would significantly diminish the effect of simulated pruning,
since 0 as input is also meaningful for
In NNI 3.0, this can be avoided by setting the input and output masks and
to ensure that
softmax obtains the correct simulated pruning result.
In the previous version of NNI (lower than 3.0), three pruning modes were supported:
normal mode, each module was required to be assigned a sparse ratio, and the pruner generated masks directly on the weight elements of this ratio.
global mode, a sparse ratio was set for a group of modules, and the pruner generated masks whose overall sparse ratio conformed to the setting,
but the sparsity of each module in the group may differ.
dependency-aware mode constrained modules with operational dependencies to generate related masks.
For instance, if the outputs of two modules had an
add relationship, then the two modules would have the same masks in the output dimension.
Different modes were better suited to different compression scenarios to achieve improved compression effects. Nevertheless, we believe that more flexible combinations should be allowed. For example, in a compression process, certain modules of similar levels could apply the overall sparse ratio, while other modules with operational dependencies could generate similar masks at the same time.
Right now in NNI 3.0, users can directly set global_group_id and dependency_group_id to implement
Additionally, align is supported to generate a mask from another module mask, such as generating a batch normalization mask from a convolution mask.
You can achieve improved performance and exploration by combining these modes by setting the appropriate keys in the configuration list.
The previous method of pruning speedup relied on
torch.jit.trace to trace the model graph.
However, this method had several limitations and required additional support to perform certain operations.
These limitations resulted in excessive maintenance costs, making it difficult to continue development.
To address these issues, in NNI 3.0, we refactored the pruning speedup based on
This is a useful utility for tracing a model graph, based on
concrete_trace executes the entire model, resulting in a more complete graph.
As a result, most operations that couldn't be traced in the previous pruning speedup can now be traced.
In addition to
concrete_trace, users who have a good
torch.fx.GraphModule for their traced model can also use the
Furthermore, the new pruning speedup supports customized masks propagation logic and module replacement methods to cope with the speedup of various customized modules.
Model fusion is supported in NNI 3.0. You can use it easily by setting
fuse_names in each configure in the config_list.
Please refer Module Fusion for more details.
Two distillers is supported in NNI 3.0. By pruning or quantization fused distillation, it can get better compression results and higher precision.
Please refer Distiller for more details.
Thanks to the new unified compression framework, it is now possible to perform pruning, quantization, and distillation simultaneously, without having to apply them one by one.
Please refer fusion compression for more details.