Run an Experiment on Aliyun PAI-DSW + PAI-DLC

NNI supports running an experiment on PAI-DSW , submit trials to PAI-DLC called dlc mode.

PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.

Setup environment

Step 1. Install NNI, follow the install guide here.

Step 2. Create PAI-DSW server following this link. Note as the training service will be run on PAI-DLC, it won’t cost many resources to run and you may just need a PAI-DSW server with CPU.

Step 3. Open PAI-DLC here, select the same region as your PAI-DSW server. Move to dataset configuration and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)

Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to this link. Skip this step if SDK is already installed.

wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
pip install ./pai-dlc-20201203  # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.

Run an experiment

Use examples/trials/mnist-pytorch as an example. The NNI config YAML file’s content is like:

# working directory on DSW, please provie FULL path
experimentWorkingDirectory: /home/admin/workspace/{your_working_dir}
searchSpaceFile: search_space.json
# the command on trial runner(or, DLC container), be aware of data_dir
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
trialConcurrency: 1  # NOTE: please provide number <= 3 due to DLC system limit.
maxTrialNumber: 10
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
trainingService:
  platform: dlc
  type: Worker
  image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
  jobType: PyTorchJob                             # choices: [TFJob, PyTorchJob]
  podCount: 1
  ecsSpec: ecs.c6.large
  region: cn-hangzhou
  nasDataSourceId: ${your_nas_data_source_id}
  accessKeyId: ${your_ak_id}
  accessKeySecret: ${your_ak_key}
  nasDataSourceId: ${your_nas_data_source_id}     # NAS datasource ID,e.g., datat56by9n1xt0a
  localStorageMountPoint: /home/admin/workspace/  # default NAS path on DSW
  containerStorageMountPoint: /root/data/         # default NAS path on DLC container, change it according your setting

Note: You should set platform: dlc in NNI config YAML file if you want to start experiment in dlc mode.

Compared with LocalMode training service configuration in dlc mode have these additional keys like type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret, for detailed explanation ref to this link.

Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: localStorageMountPoint and containerStorageMountPoint.

Run the following commands to start the example experiment:

git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-pytorch

# modify config_dlc.yml ...

nnictl create --config config_dlc.yml

Replace ${NNI_VERSION} with a released version name or branch name, e.g., v2.3.

Monitor your job

To monitor your job on DLC, you need to visit DLC to check job status.