Overview

NNI has supported many training services listed below. Users can go through each page to learning how to configure the corresponding training service. NNI has high extensibility by design, users can customize new training service for their special resource, platform or needs.

Training Service

Description

Local

The whole experiment runs on your dev machine (i.e., a single local machine)

Remote

The trials are dispatched to your configured SSH servers

OpenPAI

Running trials on OpenPAI, a DNN model training platform based on Kubernetes

Kubeflow

Running trials with Kubeflow, a DNN model training framework based on Kubernetes

AdaptDL

Running trials on AdaptDL, an elastic DNN model training platform

FrameworkController

Running trials with FrameworkController, a DNN model training framework on Kubernetes

AML

Running trials on Azure Machine Learning (AML) cloud service

PAI-DLC

Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK

Hybrid

Support jointly using multiple above training services

Training Service Under Reuse Mode

Since NNI v2.0, there are two sets of training service implementations in NNI. The new one is called reuse mode. When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.

Note

In the reuse mode, users need to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).