Remote Training Service¶
NNI can run one experiment on multiple remote machines through SSH, called
remote mode. It’s like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports
Windows 10, and
Windows Server 2019.
Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into
commandfield of NNI config.
Make sure remote machines can be accessed through SSH from the machine which runs
nnictlcommand. It supports both password and key authentication of SSH. For advanced usage, please refer to RemoteConfig in reference for detailed usage.
Make sure the NNI version on each machine is consistent. Follow the install guide here to install NNI.
Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called
python3on Linux, and
In addition, there are several steps for Windows server.
Install and start
Settingsapp on Windows.
Apps, then click
Add a feature, search and select
OpenSSH Server, and then click
Once it’s installed, run below command to start and set to automatic start.sc config sshd start=auto net start sshd
Make sure remote account is administrator, so that it can stop running trials.
Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you’re using Data Science VM on Azure, it needs to remove extra echo commands in
The output like below is ok, when opening a new command window.Microsoft Windows [Version 10.0.17763.1192] (c) 2018 Microsoft Corporation. All rights reserved. (py37_default) C:\Users\AzureUser>
examples/trials/mnist-pytorch as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Here is a template configuration specification.
searchSpaceFile: search_space.json trialCommand: python3 mnist.py trialGpuNumber: 0 trialConcurrency: 4 maxTrialNumber: 20 tuner: name: TPE classArgs: optimize_mode: maximize trainingService: platform: remote machineList: - host: 192.0.2.1 user: alice ssh_key_file: ~/.ssh/id_rsa - host: 192.0.2.2 port: 10022 user: bob password: bob123
The example configuration is saved in
You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
nnictl create --config examples/trials/mnist-pytorch/config_remote.yml
If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your trial code directory contains too many files, you can choose which files and subfolders should be excluded by adding a
.nniignore file that works like a
.gitignore file. For more details on how to write this file, see the git documentation.
Example: config_detailed.yml and .nniignore
Configure python environment¶
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use pythonPath to specify a python environment on your remote machine.
For example, with anaconda you can specify:
Monitor via TensorBoard¶
Remote training service support trial visualization via TensorBoard. Follow the guide Visualize Trial with TensorBoard to learn how to use TensorBoard.