CDARTS

Introduction

CDARTS builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network.

In implementation of CdartsTrainer, it first instantiates two models and two mutators (one for each). The first model is the so-called “search network”, which is mutated with a RegularizedDartsMutator – a mutator with subtle differences with DartsMutator. The second model is the “evaluation network”, which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to references if they are interested in more details on these trainers and mutators.

Reproduction Results

This is CDARTS based on the NNI platform, which currently supports CIFAR10 search and retrain. ImageNet search and retrain should also be supported, and we provide corresponding interfaces. Our reproduced results on NNI are slightly lower than the paper, but much higher than the original DARTS. Here we show the results of three independent experiments on CIFAR10.

Runs Paper NNI
1 97.52 97.44
2 97.53 97.48
3 97.58 97.56

Examples

Example code

# In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
git clone https://github.com/Microsoft/nni.git

# install apex for distributed training.
git clone https://github.com/NVIDIA/apex
cd apex
python setup.py install --cpp_ext --cuda_ext

# search the best architecture
cd examples/nas/cdarts
bash run_search_cifar.sh

# train the best architecture.
bash run_retrain_cifar.sh

Reference

PyTorch

class nni.nas.pytorch.cdarts.CdartsTrainer(model_small, model_large, criterion, loaders, samplers, logger=None, regular_coeff=5, regular_ratio=0.2, warmup_epochs=2, fix_head=True, epochs=32, steps_per_epoch=None, loss_alpha=2, loss_T=2, distributed=True, log_frequency=10, grad_clip=5.0, interactive_type='kl', output_path='./outputs', w_lr=0.2, w_momentum=0.9, w_weight_decay=0.0003, alpha_lr=0.2, alpha_weight_decay=0.0001, nasnet_lr=0.2, local_rank=0, share_module=True)[source]

CDARTS trainer.

Parameters:
  • model_small (nn.Module) – PyTorch model to be trained. This is the search network of CDARTS.
  • model_large (nn.Module) – PyTorch model to be trained. This is the evaluation network of CDARTS.
  • criterion (callable) – Receives logits and ground truth label, return a loss tensor, e.g., nn.CrossEntropyLoss().
  • loaders (list of torch.utils.data.DataLoader) – List of train data and valid data loaders, for training weights and architecture weights respectively.
  • samplers (list of torch.utils.data.Sampler) – List of train data and valid data samplers. This can be PyTorch standard samplers if not distributed. In distributed mode, sampler needs to have set_epoch method. Refer to data utils in CDARTS example for details.
  • logger (logging.Logger) – The logger for logging. Will use nni logger by default (if logger is None).
  • regular_coeff (float) – The coefficient of regular loss.
  • regular_ratio (float) – The ratio of regular loss.
  • warmup_epochs (int) – The epochs to warmup the search network
  • fix_head (bool) – True if fixing the paramters of auxiliary heads, else unfix the paramters of auxiliary heads.
  • epochs (int) – Number of epochs planned for training.
  • steps_per_epoch (int) – Steps of one epoch.
  • loss_alpha (float) – The loss coefficient.
  • loss_T (float) – The loss coefficient.
  • distributed (bool) – True if using distributed training, else non-distributed training.
  • log_frequency (int) – Step count per logging.
  • grad_clip (float) – Gradient clipping for weights.
  • interactive_type (string) – kl or smoothl1.
  • output_path (string) – Log storage path.
  • w_lr (float) – Learning rate of the search network parameters.
  • w_momentum (float) – Momentum of the search and the evaluation network.
  • w_weight_decay (float) – The weight decay the search and the evaluation network parameters.
  • alpha_lr (float) – Learning rate of the architecture parameters.
  • alpha_weight_decay (float) – The weight decay the architecture parameters.
  • nasnet_lr (float) – Learning rate of the evaluation network parameters.
  • local_rank (int) – The number of thread.
  • share_module (bool) – True if sharing the stem and auxiliary heads, else not sharing these modules.
class nni.nas.pytorch.cdarts.RegularizedDartsMutator(model)[source]

This is DartsMutator basically, with two differences.

1. Choices can be cut (bypassed). This is done by cut_choices. Cutted choices will not be used in forward pass and thus consumes no memory.

  1. Regularization on choices, to prevent the mutator from overfitting on some choices.
cut_choices(cut_num=2)[source]

Cut the choices with the smallest weights. cut_num should be the accumulative number of cutting, e.g., if first time cutting is 2, the second time should be 4 to cut another two.

Parameters:cut_num (int) – Number of choices to cut, so far.

Warning

Though the parameters are set to \(-\infty\) to be bypassed, they will still receive gradient of 0, which introduced nan problem when calling optimizer.step(). To solve this issue, a simple way is to reset nan to \(-\infty\) each time after the parameters are updated.

export(logger=None)[source]

Export an architecture with logger. Genotype will be printed with logger.

Returns:A mapping from mutable keys to decisions.
Return type:dict
reset()[source]

Warning

Renamed reset_with_loss() to return regularization loss on reset.

reset_with_loss()[source]

Resample and return loss. If loss is 0, to avoid device issue, it will return None.

Currently loss penalty are proportional to the L1-norm of parameters corresponding to modules if their type name contains certain substrings. These substrings include: poolwithoutbn, identity, dilconv.

Override to implement this method to iterate over mutables and make decisions.

Returns:A mapping from key of mutables to decisions.
Return type:dict
class nni.nas.pytorch.cdarts.DartsDiscreteMutator(model, parent_mutator)[source]

A mutator that applies the final sampling result of a parent mutator on another model to train.

Parameters:
  • model (nn.Module) – The model to apply the mutator.
  • parent_mutator (Mutator) – The mutator that provides sample_final method, that will be called to get the architecture.

Override to implement this method to iterate over mutables and make decisions.

Returns:A mapping from key of mutables to decisions.
Return type:dict
class nni.nas.pytorch.cdarts.RegularizedMutatorParallel(*args, **kwargs)[source]

Parallelize RegularizedDartsMutator.

This makes reset_with_loss() method parallelized, also allowing cut_choices() and export() to be easily accessible.

cut_choices(*args, **kwargs)[source]

Parallelized cut_choices().

export(logger)[source]

Parallelized export().

reset_with_loss()[source]

Parallelized reset_with_loss().