Run an Experiment on Remote Machines¶
NNI can run one experiment on multiple remote machines through SSH, called
remote mode. It’s like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports
Windows 10, and
Windows Server 2019.
- Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into
commandfield of NNI config.
- Make sure remote machines can be accessed through SSH from the machine which runs
nnictlcommand. It supports both password and key authentication of SSH. For advanced usages, please refer to machineList part of configuration.
- Make sure the NNI version on each machine is consistent.
- Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called
python3on Linux, and
Follow installation to install NNI on the remote machine.
Install and start
Settingsapp on Windows.
Apps, then click
Add a feature, search and select
OpenSSH Server, and then click
- Once it’s installed, run below command to start and set to automatic start.
sc config sshd start=auto net start sshd
Make sure remote account is administrator, so that it can stop running trials.
Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you’re using Data Science VM on Azure, it needs to remove extra echo commands in
The output like below is ok, when opening a new command window.
Microsoft Windows [Version 10.0.17763.1192] (c) 2018 Microsoft Corporation. All rights reserved. (py37_default) C:\Users\AzureUser>
Run an experiment¶
e.g. there are three machines, which can be logged in with username and password.
Install and run NNI on one of those three machines or another machine, which has network access to them.
examples/trials/mnist-annotation as the example. Below is content of
authorName: default experimentName: example_mnist trialConcurrency: 1 maxExecDuration: 1h maxTrialNum: 10 #choice: local, remote, pai trainingServicePlatform: remote # search space file searchSpacePath: search_space.json #choice: true, false useAnnotation: true tuner: #choice: TPE, Random, Anneal, Evolution, BatchTuner #SMAC (SMAC should be installed through nnictl) builtinTunerName: TPE classArgs: #choice: maximize, minimize optimize_mode: maximize trial: command: python3 mnist.py codeDir: . gpuNum: 0 #machineList can be empty if the platform is local machineList: - ip: 10.1.1.1 username: bob passwd: bob123 #port can be skip if using default ssh port 22 #port: 22 - ip: 10.1.1.2 username: bob passwd: bob123 - ip: 10.1.1.3 username: bob passwd: bob123
codeDir will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
nnictl create --config examples/trials/mnist-annotation/config_remote.yml