Strategy

Multi-trial Strategy

Random

class nni.retiarii.strategy.Random(variational=False, dedup=True, model_filter=None)[source]

Random search on the search space.

Parameters:
  • variational (bool) – Do not dry run to get the full search space. Used when the search space has variational size or candidates. Default: false.

  • dedup (bool) – Do not try the same configuration twice. When variational is true, deduplication is not supported. Default: true.

  • model_filter (Callable[[Model], bool]) – Feed the model and return a bool. This will filter the models in search space and select which to submit.

GridSearch

class nni.retiarii.strategy.GridSearch(shuffle=True)[source]

Traverse the search space and try all the possible combinations one by one.

Parameters:

shuffle (bool) – Shuffle the order in a candidate list, so that they are tried in a random order. Default: true.

RegularizedEvolution

class nni.retiarii.strategy.RegularizedEvolution(optimize_mode='maximize', population_size=100, sample_size=25, cycles=20000, mutation_prob=0.05, dedup=False, dedup_retries=500, on_failure='ignore', model_filter=None)[source]

Algorithm for regularized evolution (i.e. aging evolution). Follows “Algorithm 1” in Real et al. “Regularized Evolution for Image Classifier Architecture Search”.

Parameters:
  • optimize_mode (str) – Can be one of “maximize” and “minimize”. Default: maximize.

  • population_size (int) – The number of individuals to keep in the population. Default: 100.

  • cycles (int) – The number of cycles (trials) the algorithm should run for. Default: 20000.

  • sample_size (int) – The number of individuals that should participate in each tournament. Default: 25.

  • mutation_prob (float) – Probability that mutation happens in each dim. Default: 0.05

  • dedup (bool) – Do not try the same configuration twice. Default: true.

  • dedup_retries (int) – If dedup is true, retry the same configuration up to dedup_retries times. Default: 500.

  • on_failure (str) – Can be one of “ignore” and “worst”. If “ignore”, simply give up the model and find a new one. If “worst”, mark the model as -inf (if maximize, inf if minimize), so that the algorithm “learns” to avoid such model. Default: ignore.

  • model_filter (Callable[[Model], bool]) – Feed the model and return a bool. This will filter the models in search space and select which to submit.

TPE

class nni.retiarii.strategy.TPE[source]

The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach.

Find the details in Algorithms for Hyper-Parameter Optimization.

SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements, and then subsequently choose new hyperparameters to test based on this model.

PolicyBasedRL

class nni.retiarii.strategy.PolicyBasedRL(max_collect=100, trial_per_collect=20, policy_fn=None)[source]

Algorithm for policy-based reinforcement learning. This is a wrapper of algorithms provided in tianshou (PPO by default), and can be easily customized with other algorithms that inherit BasePolicy (e.g., REINFORCE as in this paper).

Parameters:
  • max_collect (int) – How many times collector runs to collect trials for RL. Default 100.

  • trial_per_collect (int) – How many trials (trajectories) each time collector collects. After each collect, trainer will sample batch from replay buffer and do the update. Default: 20.

  • policy_fn (function) – Takes ModelEvaluationEnv as input and return a policy. See PolicyBasedRL._default_policy_fn() for an example.

One-shot Strategy

Note

The usage of one-shot has been refreshed in v2.8. Please see legacy one-shot trainers for the old-style one-shot strategies.

DARTS

class nni.retiarii.strategy.DARTS(**kwargs)[source]

Continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Reference.

DARTS algorithm is one of the most fundamental one-shot algorithm. DARTS repeats iterations, where each iteration consists of 2 training phases. The phase 1 is architecture step, in which model parameters are frozen and the architecture parameters are trained. The phase 2 is model step, in which architecture parameters are frozen and model parameters are trained. In both phases, training_step of the Lightning evaluator will be used.

The current implementation corresponds to DARTS (1st order) in paper. Second order (unrolled 2nd-order derivatives) is not supported yet.

Note

DARTS is running a weighted sum of possible architectures under the hood. Please bear in mind that it will be slower and consume more memory that training a single architecture. The common practice is to down-scale the network (e.g., smaller depth / width) for speedup.

New in version 2.8: Supports searching for ValueChoices on operations, with the technique described in FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions. One difference is that, in DARTS, we are using Softmax instead of GumbelSoftmax.

The supported mutation primitives of DARTS are:

Warning

The strategy, under the hood, creates a Lightning module that wraps the Lightning module defined in evaluator, and enables Manual optimization, although we assume the inner evaluator has enabled automatic optimization. We call the optimizers and schedulers configured in evaluator, following the definition in Lightning at best effort, but we make no guarantee that the behaviors are exactly same as automatic optimization. We call advance_optimization() and advance_lr_schedulers() to invoke the optimizers and schedulers configured in evaluators. Moreover, some advanced features like gradient clipping will not be supported. If you encounter any issues, please contact us by creating an issue.

Parameters:
  • mutation_hooks (list[MutationHook]) –

    Extra mutation hooks to support customized mutation on primitives other than built-ins.

    Mutation hooks are callable that inputs an Module and returns a BaseSuperNetModule. They are invoked in traverse_and_mutate_submodules(), on each submodules. For each submodule, the hook list are invoked subsequently, the later hooks can see the result from previous hooks. The modules that are processed by mutation_hooks will be replaced by the returned module, stored in nas_modules, and be the focus of the NAS algorithm.

    The hook list will be appended by default_mutation_hooks in each one-shot module.

    To be more specific, the input arguments are four arguments:

    1. a module that might be processed,

    2. name of the module in its parent module,

    3. a memo dict whose usage depends on the particular algorithm.

    4. keyword arguments (configurations).

    Note that the memo should be read/written by hooks. There won’t be any hooks called on root module.

    The returned arguments can be also one of the three kinds:

    1. tuple of: BaseSuperNetModule or None, and boolean,

    2. boolean,

    3. BaseSuperNetModule or None.

    The boolean value is suppress indicates whether the following hooks should be called. When it’s true, it suppresses the subsequent hooks, and they will never be invoked. Without boolean value specified, it’s assumed to be false. If a none value appears on the place of BaseSuperNetModule, it means the hook suggests to keep the module unchanged, and nothing will happen.

    An example of mutation hook is given in no_default_hook(). However it’s recommended to implement mutation hooks by deriving BaseSuperNetModule, and add its classmethod mutate to this list.

  • arc_learning_rate (float) – Learning rate for architecture optimizer. Default: 3.0e-4

  • gradient_clip_val (float) – Clip gradients before optimizing models at each step. Default: None

ENAS

class nni.retiarii.strategy.ENAS(**kwargs)[source]

RL controller learns to generate the best network on a super-net. See ENAS paper.

There are 2 steps in an epoch.

  • Firstly, training model parameters.

  • Secondly, training ENAS RL agent. The agent will produce a sample of model architecture to get the best reward.

Attention

ENAS requires the evaluator to report metrics via self.log in its validation_step. See explanation of reward_metric_name for details.

The supported mutation primitives of ENAS are:

Warning

The strategy, under the hood, creates a Lightning module that wraps the Lightning module defined in evaluator, and enables Manual optimization, although we assume the inner evaluator has enabled automatic optimization. We call the optimizers and schedulers configured in evaluator, following the definition in Lightning at best effort, but we make no guarantee that the behaviors are exactly same as automatic optimization. We call advance_optimization() and advance_lr_schedulers() to invoke the optimizers and schedulers configured in evaluators. Moreover, some advanced features like gradient clipping will not be supported. If you encounter any issues, please contact us by creating an issue.

Parameters:
  • mutation_hooks (list[MutationHook]) –

    Extra mutation hooks to support customized mutation on primitives other than built-ins.

    Mutation hooks are callable that inputs an Module and returns a BaseSuperNetModule. They are invoked in traverse_and_mutate_submodules(), on each submodules. For each submodule, the hook list are invoked subsequently, the later hooks can see the result from previous hooks. The modules that are processed by mutation_hooks will be replaced by the returned module, stored in nas_modules, and be the focus of the NAS algorithm.

    The hook list will be appended by default_mutation_hooks in each one-shot module.

    To be more specific, the input arguments are four arguments:

    1. a module that might be processed,

    2. name of the module in its parent module,

    3. a memo dict whose usage depends on the particular algorithm.

    4. keyword arguments (configurations).

    Note that the memo should be read/written by hooks. There won’t be any hooks called on root module.

    The returned arguments can be also one of the three kinds:

    1. tuple of: BaseSuperNetModule or None, and boolean,

    2. boolean,

    3. BaseSuperNetModule or None.

    The boolean value is suppress indicates whether the following hooks should be called. When it’s true, it suppresses the subsequent hooks, and they will never be invoked. Without boolean value specified, it’s assumed to be false. If a none value appears on the place of BaseSuperNetModule, it means the hook suggests to keep the module unchanged, and nothing will happen.

    An example of mutation hook is given in no_default_hook(). However it’s recommended to implement mutation hooks by deriving BaseSuperNetModule, and add its classmethod mutate to this list.

  • ctrl_kwargs (dict) – Optional kwargs that will be passed to ReinforceController.

  • entropy_weight (float) – Weight of sample entropy loss in RL.

  • skip_weight (float) – Weight of skip penalty loss. See ReinforceController for details.

  • baseline_decay (float) – Decay factor of reward baseline, which is used to normalize the reward in RL. At each step, the new reward baseline will be equal to baseline_decay * baseline_old + reward * (1 - baseline_decay).

  • ctrl_steps_aggregate (int) – Number of steps for which the gradients will be accumulated, before updating the weights of RL controller.

  • ctrl_grad_clip (float) – Gradient clipping value of controller.

  • log_prob_every_n_step (int) – Log the probability of choices every N steps. Useful for visualization and debugging.

  • reward_metric_name (str or None) – The name of the metric which is treated as reward. This will be not effective when there’s only one metric returned from evaluator. If there are multiple, by default, it will find the metric with key name default. If reward_metric_name is specified, it will find reward_metric_name. Otherwise it raises an exception indicating multiple metrics are found.

class nni.retiarii.oneshot.pytorch.enas.ReinforceController(fields, lstm_size=64, lstm_num_layers=1, tanh_constant=1.5, skip_target=0.4, temperature=None, entropy_reduction='sum')[source]

A controller that mutates the graph with RL.

Parameters:
  • fields (list of ReinforceField) – List of fields to choose.

  • lstm_size (int) – Controller LSTM hidden units.

  • lstm_num_layers (int) – Number of layers for stacked LSTM.

  • tanh_constant (float) – Logits will be equal to tanh_constant * tanh(logits). Don’t use tanh if this value is None.

  • skip_target (float) – Target probability that skipconnect (chosen by InputChoice) will appear. If the chosen number of inputs is away from the skip_connect, there will be a sample skip penalty which is a KL divergence added.

  • temperature (float) – Temperature constant that divides the logits.

  • entropy_reduction (str) – Can be one of sum and mean. How the entropy of multi-input-choice is reduced.

GumbelDARTS

class nni.retiarii.strategy.GumbelDARTS(**kwargs)[source]

Choose the best block by using Gumbel Softmax random sampling and differentiable training. See FBNet and SNAS.

This is a DARTS-based method that uses gumbel-softmax to simulate one-hot distribution. Essentially, it tries to mimick the behavior of sampling one path on forward by gradually cool down the temperature, aiming to bridge the gap between differentiable architecture weights and discretization of architectures.

New in version 2.8: Supports searching for ValueChoices on operations, with the technique described in FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions.

The supported mutation primitives of GumbelDARTS are:

Note

GumbelDARTS is running a weighted sum of possible architectures under the hood. Please bear in mind that it will be slower and consume more memory that training a single architecture. The common practice is to down-scale the network (e.g., smaller depth / width) for speedup.

Warning

The strategy, under the hood, creates a Lightning module that wraps the Lightning module defined in evaluator, and enables Manual optimization, although we assume the inner evaluator has enabled automatic optimization. We call the optimizers and schedulers configured in evaluator, following the definition in Lightning at best effort, but we make no guarantee that the behaviors are exactly same as automatic optimization. We call advance_optimization() and advance_lr_schedulers() to invoke the optimizers and schedulers configured in evaluators. Moreover, some advanced features like gradient clipping will not be supported. If you encounter any issues, please contact us by creating an issue.

Parameters:
  • mutation_hooks (list[MutationHook]) –

    Extra mutation hooks to support customized mutation on primitives other than built-ins.

    Mutation hooks are callable that inputs an Module and returns a BaseSuperNetModule. They are invoked in traverse_and_mutate_submodules(), on each submodules. For each submodule, the hook list are invoked subsequently, the later hooks can see the result from previous hooks. The modules that are processed by mutation_hooks will be replaced by the returned module, stored in nas_modules, and be the focus of the NAS algorithm.

    The hook list will be appended by default_mutation_hooks in each one-shot module.

    To be more specific, the input arguments are four arguments:

    1. a module that might be processed,

    2. name of the module in its parent module,

    3. a memo dict whose usage depends on the particular algorithm.

    4. keyword arguments (configurations).

    Note that the memo should be read/written by hooks. There won’t be any hooks called on root module.

    The returned arguments can be also one of the three kinds:

    1. tuple of: BaseSuperNetModule or None, and boolean,

    2. boolean,

    3. BaseSuperNetModule or None.

    The boolean value is suppress indicates whether the following hooks should be called. When it’s true, it suppresses the subsequent hooks, and they will never be invoked. Without boolean value specified, it’s assumed to be false. If a none value appears on the place of BaseSuperNetModule, it means the hook suggests to keep the module unchanged, and nothing will happen.

    An example of mutation hook is given in no_default_hook(). However it’s recommended to implement mutation hooks by deriving BaseSuperNetModule, and add its classmethod mutate to this list.

  • gumbel_temperature (float) – The initial temperature used in gumbel-softmax.

  • use_temp_anneal (bool) – If true, a linear annealing will be applied to gumbel_temperature. Otherwise, run at a fixed temperature. See SNAS for details. Default is false.

  • min_temp (float) – The minimal temperature for annealing. No need to set this if you set use_temp_anneal False.

  • arc_learning_rate (float) – Learning rate for architecture optimizer. Default: 3.0e-4

  • gradient_clip_val (float) – Clip gradients before optimizing models at each step. Default: None

RandomOneShot

class nni.retiarii.strategy.RandomOneShot(**kwargs)[source]

Train a super-net with uniform path sampling. See reference.

In each epoch, model parameters are trained after a uniformly random sampling of each choice. Notably, the exporting result is also a random sample of the search space.

The supported mutation primitives of RandomOneShot are:

This strategy assumes inner evaluator has set automatic optimization to true.

Parameters:

mutation_hooks (list[MutationHook]) –

Extra mutation hooks to support customized mutation on primitives other than built-ins.

Mutation hooks are callable that inputs an Module and returns a BaseSuperNetModule. They are invoked in traverse_and_mutate_submodules(), on each submodules. For each submodule, the hook list are invoked subsequently, the later hooks can see the result from previous hooks. The modules that are processed by mutation_hooks will be replaced by the returned module, stored in nas_modules, and be the focus of the NAS algorithm.

The hook list will be appended by default_mutation_hooks in each one-shot module.

To be more specific, the input arguments are four arguments:

  1. a module that might be processed,

  2. name of the module in its parent module,

  3. a memo dict whose usage depends on the particular algorithm.

  4. keyword arguments (configurations).

Note that the memo should be read/written by hooks. There won’t be any hooks called on root module.

The returned arguments can be also one of the three kinds:

  1. tuple of: BaseSuperNetModule or None, and boolean,

  2. boolean,

  3. BaseSuperNetModule or None.

The boolean value is suppress indicates whether the following hooks should be called. When it’s true, it suppresses the subsequent hooks, and they will never be invoked. Without boolean value specified, it’s assumed to be false. If a none value appears on the place of BaseSuperNetModule, it means the hook suggests to keep the module unchanged, and nothing will happen.

An example of mutation hook is given in no_default_hook(). However it’s recommended to implement mutation hooks by deriving BaseSuperNetModule, and add its classmethod mutate to this list.

sub_state_dict(arch)[source]

Export the state dict of a chosen architecture. This is useful in weight inheritance of subnet as was done in SPOS, OFA and AutoFormer.

Parameters:

arch (dict[str, Any]) – The architecture to be exported.

Examples

To obtain a state dict of a chosen architecture, you can use the following code:

# Train or load a random one-shot strategy
experiment.run(...)
best_arch = experiment.export_top_models()[0]

# If users are to manipulate checkpoint in an evaluator,
# they should use this `no_fixed_arch()` statement to make sure
# instantiating model space works properly, as evaluator is running in a fixed context.
from nni.nas.fixed import no_fixed_arch
with no_fixed_arch():
    model_space = MyModelSpace()    # must create a model space again here

# If the strategy has been created previously, directly use it.
strategy = experiment.strategy

# Or load a strategy from a checkpoint
strategy = RandomOneShot()
strategy.attach_model(model_space)
strategy.model.load_state_dict(torch.load(...))

state_dict = strategy.sub_state_dict(best_arch)

The state dict can be directly loaded into a fixed architecture using fixed_arch:

with fixed_arch(best_arch):
    model = MyModelSpace()
model.load_state_dict(state_dict)

Another common use case is to search for a subnet on supernet with a multi-trial strategy (e.g., evolution). The key step here is to write a customized evaluator that loads the checkpoint from the supernet and run evaluations:

def evaluate_model(model_fn):
    model = model_fn()

    # Put this into `on_validation_start` or `on_train_start` if using Lightning evaluator.
    model.load_state_dict(get_subnet_state_dict())
    # Batch-norm calibration is often needed for better performance,
    # which is often running several hundreds of mini-batches to
    # re-compute running statistics of batch normalization for subnets.
    # See https://arxiv.org/abs/1904.00420 for details.
    finetune_bn(model)
    # Alternatively, you can also set batch norm to train mode to disable running statistics.
    # model.train()

    # Evaluate the model and validation dataloader.
    evaluate_acc(model)

get_subnet_state_dict() here is a bit tricky. It’s mostly the same as the pervious use case, but the architecture dict should be obtained from mutation_summary in get_current_parameter(), which corresponds to the architecture of the current trial:

def get_subnet_state_dict():
    random_oneshot_strategy = load_random_oneshot_strategy()     # Load a strategy from checkpoint, same as above
    arch_dict = nni.get_current_parameter()['mutation_summary']
    print('Architecture dict:', arch_dict)                       # Print here to see what it looks like
    return random_oneshot_strategy.sub_state_dict(arch_dict)

Proxyless

class nni.retiarii.strategy.Proxyless(**kwargs)[source]

A low-memory-consuming optimized version of differentiable architecture search. See reference.

This is a DARTS-based method that resamples the architecture to reduce memory consumption. Essentially, it samples one path on forward, and implements its own backward to update the architecture parameters based on only one path.

The supported mutation primitives of Proxyless are:

Warning

The strategy, under the hood, creates a Lightning module that wraps the Lightning module defined in evaluator, and enables Manual optimization, although we assume the inner evaluator has enabled automatic optimization. We call the optimizers and schedulers configured in evaluator, following the definition in Lightning at best effort, but we make no guarantee that the behaviors are exactly same as automatic optimization. We call advance_optimization() and advance_lr_schedulers() to invoke the optimizers and schedulers configured in evaluators. Moreover, some advanced features like gradient clipping will not be supported. If you encounter any issues, please contact us by creating an issue.

Parameters:
  • mutation_hooks (list[MutationHook]) –

    Extra mutation hooks to support customized mutation on primitives other than built-ins.

    Mutation hooks are callable that inputs an Module and returns a BaseSuperNetModule. They are invoked in traverse_and_mutate_submodules(), on each submodules. For each submodule, the hook list are invoked subsequently, the later hooks can see the result from previous hooks. The modules that are processed by mutation_hooks will be replaced by the returned module, stored in nas_modules, and be the focus of the NAS algorithm.

    The hook list will be appended by default_mutation_hooks in each one-shot module.

    To be more specific, the input arguments are four arguments:

    1. a module that might be processed,

    2. name of the module in its parent module,

    3. a memo dict whose usage depends on the particular algorithm.

    4. keyword arguments (configurations).

    Note that the memo should be read/written by hooks. There won’t be any hooks called on root module.

    The returned arguments can be also one of the three kinds:

    1. tuple of: BaseSuperNetModule or None, and boolean,

    2. boolean,

    3. BaseSuperNetModule or None.

    The boolean value is suppress indicates whether the following hooks should be called. When it’s true, it suppresses the subsequent hooks, and they will never be invoked. Without boolean value specified, it’s assumed to be false. If a none value appears on the place of BaseSuperNetModule, it means the hook suggests to keep the module unchanged, and nothing will happen.

    An example of mutation hook is given in no_default_hook(). However it’s recommended to implement mutation hooks by deriving BaseSuperNetModule, and add its classmethod mutate to this list.

  • arc_learning_rate (float) – Learning rate for architecture optimizer. Default: 3.0e-4

  • gradient_clip_val (float) – Clip gradients before optimizing models at each step. Default: None

Customization

Multi-trial

class nni.retiarii.Sampler[source]

Handles Mutator.choice() calls.

class nni.retiarii.strategy.BaseStrategy[source]
nni.retiarii.execution.budget_exhausted()[source]
nni.retiarii.execution.get_and_register_default_listener(engine)[source]
nni.retiarii.execution.get_execution_engine()[source]
nni.retiarii.execution.init_execution_engine(config, port, url_prefix)[source]
nni.retiarii.execution.is_stopped_exec(model)[source]
nni.retiarii.execution.list_models(*models)[source]
nni.retiarii.execution.query_available_resources()[source]
nni.retiarii.execution.set_execution_engine(engine)[source]
nni.retiarii.execution.submit_models(*models)[source]
nni.retiarii.execution.wait_models(*models)[source]

One-shot

base_lightning

class nni.retiarii.oneshot.pytorch.base_lightning.BaseOneShotLightningModule(model, mutation_hooks=None)[source]

The base class for all one-shot NAS modules.

In NNI, we try to separate the “search” part and “training” part in one-shot NAS. The “training” part is defined with evaluator interface (has to be lightning evaluator interface to work with oneshot). Since the lightning evaluator has already broken down the training into minimal building blocks, we can re-assemble them after combining them with the “search” part of a particular algorithm.

After the re-assembling, this module has defined all the search + training. The experiment can use a lightning trainer (which is another part in the evaluator) to train this module, so as to complete the search process.

Essential function such as preprocessing user’s model, redirecting lightning hooks for user’s model, configuring optimizers and exporting NAS result are implemented in this class.

nas_modules

Modules that have been mutated, which the search algorithms should care about.

Type:

list[BaseSuperNetModule]

model

PyTorch lightning module. A model space with training recipe defined (wrapped by LightningModule in evaluator).

Type:

pl.LightningModule

Parameters:
  • inner_module (pytorch_lightning.LightningModule) – It’s a LightningModule that defines computations, train/val loops, optimizers in a single class. When used in NNI, the inner_module is the combination of instances of evaluator + base model (to be precise, a base model wrapped with LightningModule in evaluator).

  • mutation_hooks (list[MutationHook]) –

    Extra mutation hooks to support customized mutation on primitives other than built-ins.

    Mutation hooks are callable that inputs an Module and returns a BaseSuperNetModule. They are invoked in traverse_and_mutate_submodules(), on each submodules. For each submodule, the hook list are invoked subsequently, the later hooks can see the result from previous hooks. The modules that are processed by mutation_hooks will be replaced by the returned module, stored in nas_modules, and be the focus of the NAS algorithm.

    The hook list will be appended by default_mutation_hooks in each one-shot module.

    To be more specific, the input arguments are four arguments:

    1. a module that might be processed,

    2. name of the module in its parent module,

    3. a memo dict whose usage depends on the particular algorithm.

    4. keyword arguments (configurations).

    Note that the memo should be read/written by hooks. There won’t be any hooks called on root module.

    The returned arguments can be also one of the three kinds:

    1. tuple of: BaseSuperNetModule or None, and boolean,

    2. boolean,

    3. BaseSuperNetModule or None.

    The boolean value is suppress indicates whether the following hooks should be called. When it’s true, it suppresses the subsequent hooks, and they will never be invoked. Without boolean value specified, it’s assumed to be false. If a none value appears on the place of BaseSuperNetModule, it means the hook suggests to keep the module unchanged, and nothing will happen.

    An example of mutation hook is given in no_default_hook(). However it’s recommended to implement mutation hooks by deriving BaseSuperNetModule, and add its classmethod mutate to this list.

advance_lr_schedulers(batch_idx)[source]

Advance the learning rates, when manual optimization is turned on.

The full implementation is here. We only include a partial implementation here. Advanced features like Reduce-lr-on-plateau are not supported.

advance_optimization(loss, batch_idx, gradient_clip_val=None, gradient_clip_algorithm=None)[source]

Run the optimizer defined in evaluators, when manual optimization is turned on.

Call this method when the model should be optimized. To keep it as neat as possible, we only implement the basic zero_grad, backward, grad_clip, and step here. Many hooks and pre/post-processing are omitted. Inherit this method if you need more advanced behavior.

The full optimizer step could be found here. We only implement part of the optimizer loop here.

Parameters:

batch_idx (int) – The current batch index.

architecture_optimizers()[source]

Get the optimizers configured in configure_architecture_optimizers().

configure_architecture_optimizers()[source]

Hook kept for subclasses. A specific NAS method inheriting this base class should return its architecture optimizers here if architecture parameters are needed. Note that lr schedulers are not supported now for architecture_optimizers.

Return type:

Optimizers used by a specific NAS algorithm. Return None if no architecture optimizers are needed.

configure_optimizers()[source]

Transparently configure optimizers for the inner model, unless one-shot algorithm has its own optimizer (via configure_architecture_optimizers()), in which case, the optimizer will be appended to the list.

The return value is still one of the 6 types defined in PyTorch-Lightning.

default_mutation_hooks()[source]

Override this to define class-default mutation hooks.

export()[source]

Export the NAS result, ideally the best choice of each nas_modules. You may implement an export method for your customized nas_modules.

Returns:

Keys are names of nas_modules, and values are the choice indices of them.

Return type:

dict

export_probs()[source]

Export the probability of every choice in the search space got chosen.

Note

If such method of some modules is not implemented, they will be simply ignored.

Returns:

In most cases, keys are names of nas_modules suffixed with / and choice name. Values are the probability / logits depending on the implementation.

Return type:

dict

mutate_kwargs()[source]

Extra keyword arguments passed to mutation hooks. Usually algo-specific.

resample(memo=None)[source]

Trigger the resample for each nas_modules. Sometimes (e.g., in differentiable cases), it does nothing.

Parameters:

memo (dict[str, Any]) – Used to ensure the consistency of samples with the same label.

Returns:

Sampled architecture.

Return type:

dict

search_space_spec()[source]

Get the search space specification from nas_modules.

Returns:

Key is the name of the choice, value is the corresponding ParameterSpec.

Return type:

dict

class nni.retiarii.oneshot.pytorch.base_lightning.BaseSuperNetModule[source]

Mutated module in super-net. Usually, the feed-forward of the module itself is undefined. It has to be resampled with resample() so that a specific path is selected. (Sometimes, this is not required. For example, differentiable super-net.)

A super-net module usually corresponds to one sample. But two exceptions:

  • A module can have multiple parameter spec. For example, a convolution-2d can sample kernel size, channels at the same time.

  • Multiple modules can share one parameter spec. For example, multiple layer choices with the same label.

For value choice compositions, the parameter spec are bounded to the underlying (original) value choices, rather than their compositions.

export(memo)[source]

Export the final architecture within this module. It should have the same keys as search_space_spec().

Parameters:

memo (dict[str, Any]) – Use memo to avoid the same label gets exported multiple times.

export_probs(memo)[source]

Export the probability / logits of every choice got chosen.

Parameters:

memo (dict[str, Any]) – Use memo to avoid the same label gets exported multiple times.

classmethod mutate(module, name, memo, mutate_kwargs)[source]

This is a mutation hook that creates a BaseSuperNetModule. The method should be implemented in each specific super-net module, because they usually have specific rules about what kind of modules to operate on.

Parameters:
  • module (nn.Module) – The module to be mutated (replaced).

  • name (str) – Name of this module. With full prefix. For example, module1.block1.conv.

  • memo (dict) – Memo to enable sharing parameters among mutated modules. It should be read and written by mutate functions themselves.

  • mutate_kwargs (dict) – Algo-related hyper-parameters, and some auxiliary information.

Returns:

The mutation result, along with an optional boolean flag indicating whether to suppress follow-up mutation hooks. See BaseOneShotLightningModule for details.

Return type:

Union[BaseSuperNetModule, bool, tuple[BaseSuperNetModule, bool]]

resample(memo)[source]

Resample the super-net module.

Parameters:

memo (dict[str, Any]) – Used to ensure the consistency of samples with the same label.

Returns:

Sampled result. If nothing new is sampled, it should return an empty dict.

Return type:

dict

search_space_spec()[source]

Space specification (sample points). Mapping from spec name to ParameterSpec. The names in choices should be in the same format of export.

For example:

{"layer1": ParameterSpec(values=["conv", "pool"])}
nni.retiarii.oneshot.pytorch.base_lightning.no_default_hook(module, name, memo, mutate_kwargs)[source]

Add this hook at the end of your hook list to raise error for unsupported mutation primitives.

nni.retiarii.oneshot.pytorch.base_lightning.traverse_and_mutate_submodules(root_module, hooks, mutate_kwargs, topdown=True)[source]

Traverse the module-tree of root_module, and call hooks on every tree node.

Parameters:
  • root_module (nn.Module) – User-defined model space. Since this method is called in the __init__ of BaseOneShotLightningModule, it’s usually a pytorch_lightning.LightningModule. The mutation will be in-place on root_module.

  • hooks (list[MutationHook]) – List of mutation hooks. See BaseOneShotLightningModule for how to write hooks. When a hook returns an module, the module will be replaced (mutated) to the new module.

  • mutate_kwargs (dict) – Extra keyword arguments passed to hooks.

  • topdown (bool, default = False) – If topdown is true, hooks are first called, before traversing its sub-module (i.e., pre-order DFS). Otherwise, sub-modules are first traversed, before calling hooks on this node (i.e., post-order DFS).

Returns:

modules – The replace result.

Return type:

dict[str, nn.Module]

dataloader

class nni.retiarii.oneshot.pytorch.dataloader.ConcatLoader(loaders, mode='min_size')[source]

This loader is same as CombinedLoader in PyTorch-Lightning, but concatenate sub-loaders instead of loading them in parallel.

Parameters:
  • loaders (dict[str, Any]) –

    For example,

    {
        "train": DataLoader(train_dataset),
        "val": DataLoader(val_dataset)
    }
    

    In this example, the loader will first produce the batches from “train”, then “val”.

  • mode (str) – Only support “min_size” for now.

supermodule.differentiable

class nni.retiarii.oneshot.pytorch.supermodule.differentiable.DifferentiableMixedCell(op_factory, num_nodes, num_ops_per_node, num_predecessors, preprocessor, postprocessor, concat_dim, memo, mutate_kwargs, label)[source]

Implementation of Cell under differentiable context.

Similar to PathSamplingCell, this cell only handles cells of specific kinds (e.g., with loose end).

An architecture parameter is created on each edge of the full-connected graph.

export(memo)[source]

Tricky export.

Reference: https://github.com/quark0/darts/blob/f276dd346a09ae3160f8e3aca5c7b193fda1da37/cnn/model_search.py#L135

export_probs(memo)[source]

When export probability, we follow the structure in arch alpha.

resample(memo)[source]

Differentiable doesn’t need to resample.

class nni.retiarii.oneshot.pytorch.supermodule.differentiable.DifferentiableMixedInput(n_candidates, n_chosen, alpha, softmax, label)[source]

Mixed input. Forward returns a weighted sum of candidates. Implementation is very similar to DifferentiableMixedLayer.

Parameters:
  • n_candidates (int) – Expect number of input candidates.

  • n_chosen (int) – Expect numebr of inputs finally chosen.

  • alpha (Tensor) – Tensor that stores the “learnable” weights.

  • softmax (nn.Module) – Customizable softmax function. Usually nn.Softmax(-1).

  • label (str) – Name of the choice.

label

Name of the choice.

Type:

str

export(memo)[source]

Choose the operator with the top n_chosen logits.

forward(inputs)[source]

Forward takes a list of input candidates.

named_parameters(*args, **kwargs)[source]

Named parameters excluding architecture parameters.

parameters(*args, **kwargs)[source]

Parameters excluding architecture parameters.

reduction(items, weights)[source]

Override this for customized reduction.

resample(memo)[source]

Do nothing. Differentiable layer doesn’t need resample.

class nni.retiarii.oneshot.pytorch.supermodule.differentiable.DifferentiableMixedLayer(paths, alpha, softmax, label)[source]

Mixed layer, in which fprop is decided by a weighted sum of several layers. Proposed in DARTS: Differentiable Architecture Search.

The weight alpha is usually learnable, and optimized on validation dataset.

Differentiable sampling layer requires all operators returning the same shape for one input, as all outputs will be weighted summed to get the final output.

Parameters:
  • paths (list[tuple[str, nn.Module]]) – Layers to choose from. Each is a tuple of name, and its module.

  • alpha (Tensor) – Tensor that stores the “learnable” weights.

  • softmax (nn.Module) – Customizable softmax function. Usually nn.Softmax(-1).

  • label (str) – Name of the choice.

op_names

Operator names.

Type:

str

label

Name of the choice.

Type:

str

export(memo)[source]

Choose the operator with the maximum logit.

forward(*args, **kwargs)[source]

The forward of mixed layer accepts same arguments as its sub-layer.

named_parameters(*args, **kwargs)[source]

Named parameters excluding architecture parameters.

parameters(*args, **kwargs)[source]

Parameters excluding architecture parameters.

reduction(items, weights)[source]

Override this for customized reduction.

resample(memo)[source]

Do nothing. Differentiable layer doesn’t need resample.

class nni.retiarii.oneshot.pytorch.supermodule.differentiable.DifferentiableMixedRepeat(blocks, depth, softmax, memo)[source]

Implementaion of Repeat in a differentiable supernet. Result is a weighted sum of possible prefixes, sliced by possible depths.

If the output is not a single tensor, it will be summed at every independant dimension. See weighted_sum() for details.

export(memo)[source]

Choose argmax for each leaf value choice.

export_probs(memo)[source]

Export the weight for every leaf value choice.

reduction(items, weights, depths)[source]

Override this for customized reduction.

resample(memo)[source]

Do nothing.

class nni.retiarii.oneshot.pytorch.supermodule.differentiable.GumbelSoftmax(dim=-1)[source]

Wrapper of F.gumbel_softmax. dim = -1 by default.

class nni.retiarii.oneshot.pytorch.supermodule.differentiable.MixedOpDifferentiablePolicy(operation, memo, mutate_kwargs)[source]

Implementes the differentiable sampling in mixed operation.

One mixed operation can have multiple value choices in its arguments. Thus the _arch_alpha here is a parameter dict, and named_parameters filters out multiple parameters with _arch_alpha as its prefix.

When this class is asked for forward_argument, it returns a distribution, i.e., a dict from int to float based on its weights.

All the parameters (_arch_alpha, parameters(), _softmax) are saved as attributes of operation, rather than self, because this class itself is not a nn.Module, and saved parameters here won’t be optimized.

export(operation, memo)[source]

Export is argmax for each leaf value choice.

export_probs(operation, memo)[source]

Export the weight for every leaf value choice.

resample(operation, memo)[source]

Differentiable. Do nothing in resample.

supermodule.sampling

class nni.retiarii.oneshot.pytorch.supermodule.sampling.MixedOpPathSamplingPolicy(operation, memo, mutate_kwargs)[source]

Implements the path sampling in mixed operation.

One mixed operation can have multiple value choices in its arguments. Each value choice can be further decomposed into “leaf value choices”. We sample the leaf nodes, and composits them into the values on arguments.

export(operation, memo)[source]

Export is also random for each leaf value choice.

resample(operation, memo)[source]

Random sample for each leaf value choice.

class nni.retiarii.oneshot.pytorch.supermodule.sampling.PathSamplingCell(op_factory, num_nodes, num_ops_per_node, num_predecessors, preprocessor, postprocessor, concat_dim, memo, mutate_kwargs, label)[source]

The implementation of super-net cell follows DARTS.

When factory_used is true, it reconstructs the cell for every possible combination of operation and input index, because for different input index, the cell factory could instantiate different operations (e.g., with different stride). On export, we first have best (operation, input) pairs, the select the best num_ops_per_node.

loose_end is not supported yet, because it will cause more problems (e.g., shape mismatch). We assumes loose_end to be all regardless of its configuration.

A supernet cell can’t slim its own weight to fit into a sub network, which is also a known issue.

export(memo)[source]

Randomly choose one to export.

classmethod mutate(module, name, memo, mutate_kwargs)[source]

Mutate only handles cells of specific configurations (e.g., with loose end). Fallback to the default mutate if the cell is not handled here.

resample(memo)[source]

Random choose one path if label is not found in memo.

class nni.retiarii.oneshot.pytorch.supermodule.sampling.PathSamplingInput(n_candidates, n_chosen, reduction_type, label)[source]

Mixed input. Take a list of tensor as input, select some of them and return the sum.

_sampled

Sampled input indices.

Type:

int or list of int

export(memo)[source]

Random choose one name if label isn’t found in memo.

reduction(items, sampled)[source]

Override this to implement customized reduction.

resample(memo)[source]

Random choose one path / multiple paths if label is not found in memo. If one path is selected, only one integer will be in self._sampled. If multiple paths are selected, a list will be in self._sampled.

class nni.retiarii.oneshot.pytorch.supermodule.sampling.PathSamplingLayer(paths, label)[source]

Mixed layer, in which fprop is decided by exactly one inner layer or sum of multiple (sampled) layers. If multiple modules are selected, the result will be summed and returned.

_sampled

Sampled module indices.

Type:

int or list of str

label

Name of the choice.

Type:

str

export(memo)[source]

Random choose one name if label isn’t found in memo.

reduction(items, sampled)[source]

Override this to implement customized reduction.

resample(memo)[source]

Random choose one path if label is not found in memo.

class nni.retiarii.oneshot.pytorch.supermodule.sampling.PathSamplingRepeat(blocks, depth)[source]

Implementaion of Repeat in a path-sampling supernet. Samples one / some of the prefixes of the repeated blocks.

_sampled

Sampled depth.

Type:

int or list of int

export(memo)[source]

Random choose one if every choice not in memo.

reduction(items, sampled)[source]

Override this to implement customized reduction.

resample(memo)[source]

Since depth is based on ValueChoice, we only need to randomly sample every leaf value choices.

supermodule.proxyless

class nni.retiarii.oneshot.pytorch.supermodule.proxyless.ProxylessMixedInput(n_candidates, n_chosen, alpha, softmax, label)[source]

Proxyless version of differentiable input choice. See ProxylessMixedLayer for implementation details.

forward(inputs)[source]

Choose one single input.

resample(memo)[source]

Sample one path based on alpha if label is not found in memo.

class nni.retiarii.oneshot.pytorch.supermodule.proxyless.ProxylessMixedLayer(paths, alpha, softmax, label)[source]

Proxyless version of differentiable mixed layer. It resamples a single-path every time, rather than go through the softmax.

forward(*args, **kwargs)[source]

Forward pass of one single path.

resample(memo)[source]

Sample one path based on alpha if label is not found in memo.

supermodule.operation

class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedBatchNorm2d(module_kwargs)[source]

Mixed BatchNorm2d operation.

Supported arguments are:

  • num_features

  • eps (only supported in path sampling)

  • momentum (only supported in path sampling)

For path-sampling, prefix of weight, bias, running_mean and running_var are sliced. For weighted cases, the maximum num_features is used directly.

Momentum is required to be float. PyTorch BatchNorm supports a case where momentum can be none, which is not supported here.

bound_type

alias of BatchNorm2d

class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedConv2d(module_kwargs)[source]

Mixed conv2d op.

Supported arguments are:

  • in_channels

  • out_channels

  • groups

  • stride (only supported in path sampling)

  • kernel_size

  • padding

  • dilation (only supported in path sampling)

padding will be the “max” padding in differentiable mode.

Mutable groups is NOT supported in most cases of differentiable mode. However, we do support one special case when the group number is proportional to in_channels and out_channels. This is often the case of depth-wise convolutions.

For channels, prefix will be sliced. For kernels, we take the small kernel from the center and round it to floor (left top). For example

max_kernel = 5*5, sampled_kernel = 3*3, then we take [1: 4]
max_kernel = 5*5, sampled_kernel = 2*2, then we take [1: 3]
□ □ □ □ □   □ □ □ □ □
□ ■ ■ ■ □   □ ■ ■ □ □
□ ■ ■ ■ □   □ ■ ■ □ □
□ ■ ■ ■ □   □ □ □ □ □
□ □ □ □ □   □ □ □ □ □
bound_type

alias of Conv2d

class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedLayerNorm(module_kwargs)[source]

Mixed LayerNorm operation.

Supported arguments are:

  • normalized_shape

  • eps (only supported in path sampling)

For path-sampling, prefix of weight and bias are sliced. For weighted cases, the maximum normalized_shape is used directly.

eps is required to be float.

bound_type

alias of LayerNorm

class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedLinear(module_kwargs)[source]

Mixed linear operation.

Supported arguments are:

  • in_features

  • out_features

Prefix of weight and bias will be sliced.

bound_type

alias of Linear

class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedMultiHeadAttention(module_kwargs)[source]

Mixed multi-head attention.

Supported arguments are:

  • embed_dim

  • num_heads (only supported in path sampling)

  • kdim

  • vdim

  • dropout (only supported in path sampling)

At init, it constructs the largest possible Q, K, V dimension. At forward, it slices the prefix to weight matrices according to the sampled value. For in_proj_bias and in_proj_weight, three parts will be sliced and concatenated together: [0, embed_dim), [max_embed_dim, max_embed_dim + embed_dim), [max_embed_dim * 2, max_embed_dim * 2 + embed_dim).

Warning

All candidates of embed_dim should be divisible by all candidates of num_heads.

bound_type

alias of MultiheadAttention

class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedOperation(module_kwargs)[source]

This is the base class for all mixed operations. It’s what you should inherit to support a new operation with ValueChoice.

It contains commonly used utilities that will ease the effort to write customized mixed oeprations, i.e., operations with ValueChoice in its arguments. To customize, please write your own mixed operation, and add the hook into mutation_hooks parameter when using the strategy.

By design, for a mixed operation to work in a specific algorithm, at least two classes are needed.

  1. One class needs to inherit this class, to control operation-related behavior, such as how to initialize the operation such that the sampled operation can be its sub-operation.

  2. The other one needs to inherit MixedOperationSamplingPolicy, which controls algo-related behavior, such as sampling.

The two classes are linked with sampling_policy attribute in MixedOperation, whose type is set via mixed_op_sampling in mutate_kwargs when MixedOperation.mutate() is called.

With this design, one mixed-operation (e.g., MixedConv2d) can work in multiple algorithms (e.g., both DARTS and ENAS), saving the engineering effort to rewrite all operations for each specific algo.

This class should also define a bound_type, to control the matching type in mutate, an argument_list, to control which arguments can be dynamically used in forward. This list will also be used in mutate for sanity check.

export(memo)[source]

Delegates to MixedOperationSamplingPolicy.export().

export_probs(memo)[source]

Delegates to MixedOperationSamplingPolicy.export_probs().

forward(*args, **kwargs)[source]

First get sampled arguments, then forward with the sampled arguments (by calling forward_with_args).

forward_argument(name)[source]

Get the argument used in forward. This if often related to algo. We redirect this to sampling policy.

forward_with_args(*args, **kwargs)[source]

To control real fprop. The accepted arguments are argument_list, appended by forward arguments in the bound_type.

classmethod mutate(module, name, memo, mutate_kwargs)[source]

Find value choice in module’s arguments and replace the whole module

resample(memo)[source]

Delegates to MixedOperationSamplingPolicy.resample().

slice_param(**kwargs)[source]

Slice the params and buffers for subnet forward and state dict. When there is a mapping=True in kwargs, the return result will be wrapped in dict.

super_init_argument(name, value_choice)[source]

Get the initialization argument when constructing super-kernel, i.e., calling super().__init__(). This is often related to specific operator, rather than algo.

For example:

def super_init_argument(self, name, value_choice):
    return max(value_choice.candidates)
class nni.retiarii.oneshot.pytorch.supermodule.operation.MixedOperationSamplingPolicy(operation, memo, mutate_kwargs)[source]

Algo-related part for mixed Operation.

MixedOperation delegates its resample and export to this policy (or its subclass), so that one Operation can be easily combined with different kinds of sampling.

One SamplingStrategy corresponds to one mixed operation.

export(operation, memo)[source]

The handler of MixedOperation.export().

export_probs(operation, memo)[source]

The handler of MixedOperation.export_probs().

forward_argument(operation, name)[source]

Computing the argument with name used in operation’s forward. Usually a value, or a distribution of value.

resample(operation, memo)[source]

The handler of MixedOperation.resample().