PAI-DLC Training Service

NNI supports running an experiment on PAI-DSW , submit trials to PAI-DLC which is deep learning containers based on Alibaba ACK.

PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.

Prerequisite

Step 1. Install NNI, follow the install guide.

Step 2. Create PAI-DSW server following this link. Note as the training service will be run on PAI-DLC, it won’t cost many resources to run and you may just need a PAI-DSW server with CPU.

Step 3. Open PAI-DLC here, select the same region as your PAI-DSW server. Move to dataset configuration and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)

Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to this link. Skip this step if SDK is already installed.

wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
pip install ./pai-dlc-20201203  # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.

Usage

Use examples/trials/mnist-pytorch as an example. The NNI config YAML file’s content is like:

# working directory on DSW, please provie FULL path
experimentWorkingDirectory: /home/admin/workspace/{your_working_dir}
searchSpaceFile: search_space.json
# the command on trial runner(or, DLC container), be aware of data_dir
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
trialConcurrency: 1  # NOTE: please provide number <= 3 due to DLC system limit.
maxTrialNumber: 10
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
trainingService:
  platform: dlc
  type: Worker
  image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
  jobType: PyTorchJob                             # choices: [TFJob, PyTorchJob]
  podCount: 1
  ecsSpec: ecs.c6.large
  region: cn-hangzhou
  workspaceId: ${your_workspace_id}
  accessKeyId: ${your_ak_id}
  accessKeySecret: ${your_ak_key}
  nasDataSourceId: ${your_nas_data_source_id}     # NAS datasource ID, e.g., datat56by9n1xt0a
  ossDataSourceId: ${your_oss_data_source_id}     # OSS datasource ID, in case your data is on oss
  localStorageMountPoint: /home/admin/workspace/  # default NAS path on DSW
  containerStorageMountPoint: /root/data/         # default NAS path on DLC container, change it according your setting

Note: You should set platform: dlc in NNI config YAML file if you want to start experiment in dlc mode.

Compared with Local Training Service, training service configuration in dlc mode have these additional keys like type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret, for detailed explanation ref to this link.

Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: localStorageMountPoint and containerStorageMountPoint.

Run the following commands to start the example experiment:

git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-pytorch

# modify config_dlc.yml ...

nnictl create --config config_dlc.yml

Replace ${NNI_VERSION} with a released version name or branch name, e.g., v2.3.

Monitor your job

To monitor your job on DLC, you need to visit DLC to check job status.