Run HPO Experiment with nnictl

This tutorial has exactly the same effect as HPO 教程(PyTorch 版本).

Both tutorials optimize the model in official PyTorch quickstart with auto-tuning, while this one manages the experiment with command line tool and YAML config file, instead of pure Python code.

The tutorial consists of 4 steps:

  1. Modify the model for auto-tuning.

  2. Define hyperparameters' search space.

  3. Create config file.

  4. Run the experiment.

The first two steps are identical to quickstart.

Step 1: Prepare the model

In first step, we need to prepare the model to be tuned.

The model should be put in a separate script. It will be evaluated many times concurrently, and possibly will be trained on distributed platforms.

In this tutorial, the model is defined in model.py.

In short, it is a PyTorch model with 3 additional API calls:

  1. Use nni.get_next_parameter() to fetch the hyperparameters to be evalutated.

  2. Use nni.report_intermediate_result() to report per-epoch accuracy metrics.

  3. Use nni.report_final_result() to report final accuracy.

Please understand the model code before continue to next step.

Step 2: Define search space

In model code, we have prepared 3 hyperparameters to be tuned: features, lr, and momentum.

Here we need to define their search space so the tuning algorithm can sample them in desired range.

Assuming we have following prior knowledge for these hyperparameters:

  1. features should be one of 128, 256, 512, 1024.

  2. lr should be a float between 0.0001 and 0.1, and it follows exponential distribution.

  3. momentum should be a float between 0 and 1.

In NNI, the space of features is called choice; the space of lr is called loguniform; and the space of momentum is called uniform. You may have noticed, these names are derived from numpy.random.

For full specification of search space, check the reference.

Now we can define the search space as follow:

search_space:
  features:
    _type: choice
    _value: [ 128, 256, 512, 1024 ]
  lr:
    _type: loguniform
    _value: [ 0.0001, 0.1 ]
  momentum:
    _type: uniform
    _value: [ 0, 1 ]

Step 3: Configure the experiment

NNI uses an experiment to manage the HPO process. The experiment config defines how to train the models and how to explore the search space.

In this tutorial we use a YAML file config.yaml to define the experiment.

Configure trial code

In NNI evaluation of each hyperparameter set is called a trial. So the model script is called trial code.

trial_command: python model.py
trial_code_directory: .

When trial_code_directory is a relative path, it relates to the config file. So in this case we need to put config.yaml and model.py in the same directory.

注意

The rules for resolving relative path are different in YAML config file and Python experiment API. In Python experiment API relative paths are relative to current working directory.

Configure how many trials to run

Here we evaluate 10 sets of hyperparameters in total, and concurrently evaluate 2 sets at a time.

max_trial_number: 10
trial_concurrency: 2

You may also set max_experiment_duration = '1h' to limit running time.

If neither max_trial_number nor max_experiment_duration are set, the experiment will run forever until you stop it.

备注

max_trial_number is set to 10 here for a fast example. In real world it should be set to a larger number. With default config TPE tuner requires 20 trials to warm up.

Configure tuning algorithm

Here we use TPE tuner.

name: TPE
class_args:
  optimize_mode: maximize

Configure training service

In this tutorial we use local mode, which means models will be trained on local machine, without using any special training platform.

training_service:
  platform: local

Wrap up

The full content of config.yaml is as follow:

search_space:
  features:
    _type: choice
    _value: [ 128, 256, 512, 1024 ]
  lr:
    _type: loguniform
    _value: [ 0.0001, 0.1 ]
  momentum:
    _type: uniform
    _value: [ 0, 1 ]

trial_command: python model.py
trial_code_directory: .

trial_concurrency: 2
max_trial_number: 10

tuner:
  name: TPE
  class_args:
    optimize_mode: maximize

training_service:
  platform: local

Step 4: Run the experiment

Now the experiment is ready. Launch it with nnictl create command:

$ nnictl create --config config.yaml --port 8080

You can use the web portal to view experiment status: http://localhost:8080.

Out:

[2022-04-01 12:00:00] Creating experiment, Experiment ID: p43ny6ew
[2022-04-01 12:00:00] Starting web server...
[2022-04-01 12:00:01] Setting up...
[2022-04-01 12:00:01] Web portal URLs: http://127.0.0.1:8080 http://192.168.1.1:8080
[2022-04-01 12:00:01] To stop experiment run "nnictl stop p43ny6ew" or "nnictl stop --all"
[2022-04-01 12:00:01] Reference: https://nni.readthedocs.io/en/stable/reference/nnictl.html

When the experiment is done, use nnictl stop command to stop it.

$ nnictl stop p43ny6ew

Out:

INFO:  Stopping experiment 7u8yg9zw
INFO:  Stop experiment success.