You just focus on the ML algorithm, just leave the model training to it

Posted Jun 5, 202013 min read

Exclusive Image 1.png

Where did the IT people go?

  • The written code is being compiled...
  • Configure development environment, downloading...
  • Machine learning projects? Model training...
  • Waiting for the game to pass the time? Oops, the GPU is busy training the model...

Now that cloud computing is so popular, resources such as computing, storage, and databases in enterprise application environments can all run in the cloud, so for machine learning projects, can you find a way to throw the most time-consuming model training work into the cloud? To run? Now, use Amazon SageMaker to run distributed TensorFlow training to meet your needs.

basic concept

TensorFlow is an open source machine learning(ML) library that is widely used to develop large-scale deep neural networks(DNNs). Such DNNs require distributed training and are used in multiple Multiple GPUs are used on the host.

Amazon SageMaker is a managed service that can deploy training models as auto-scaling RESTful services through active learning, hyperparameter optimization, model distributed training, and monitoring training progress, as well as centralized management of concurrent ML experiments from tag data Start simplifying ML workflow.

The following will focus on using Amazon SageMaker for distributed TensorFlow training. Although many distributed training concepts in this article are generally applicable to multiple types of TensorFlow models, this article focuses on Common Object in Context(COCO) 2017 dataset . Mask R-CNN]( ) distributed TensorFlow training.


The Mask R-CNN model is used for object instance segmentation, where the model generates a pixel-level mask(Sigmoid binary classification) and a target box(Smooth L1 regression) annotated with the object category(SoftMax classification) to depict each object instance in the image. Some common use cases for Mask R-CNN include:perception of autonomous vehicles, surface defect detection, and geospatial image analysis.

There are three key reasons for choosing the Mask R-CNN model in this article:

  1. Mask R-CNN distributed data parallel training on large data sets can increase the image throughput through the training pipeline and shorten the training time.
  2. The Mask R-CNN model has many open source TensorFlow implementations. This article uses Tensorpack Mask/Faster-RCNN as the main example, but it is also recommended to use the highly optimized AWS example Mask-RCNN .
  3. The Mask R-CNN model was evaluated as a large object detection model in the results of MLPerf .

The following figure is a schematic diagram of the Mask R-CNN deep neural network architecture:


Synchronized Allreduce gradient in distributed training

The main challenge of distributed DNN training is:before applying the gradient to update the model weights on multiple GPUs across multiple nodes, it is necessary to perform Allreduce(average) on the gradients calculated during the back propagation of all GPUs in the synchronization step Change).

Synchronization Allreduce algorithm needs to achieve high efficiency, otherwise any training speed gains obtained from parallel training of distributed data will be lost due to the inefficiency of the synchronization Allreduce step.

To make the synchronous Allreduce algorithm highly efficient, there are three main challenges:

  • The algorithm needs to be expanded as the number of nodes and GPUs in the distributed training cluster increases.
  • The algorithm needs to utilize the high-speed GPU-to-GPU interconnection topology within a single node.
  • The algorithm needs to effectively batch communication with other GPUs to effectively interleave calculations on the GPU and communication with other GPUs.
  • Uber's open source library Horovod overcomes these three major challenges in the following ways:
  • Horovod provides an efficient synchronous Allreduce algorithm, which can be expanded as the number of GPUs and nodes increases.
  • The Horovod library uses Nvidia Collective Communications Library(NCCL) communication primitives, and these communication primitives take advantage of an understanding of Nvidia GPU topology.
  • Horovod includes Tensor Fusion , which interleaves communications and computation efficiently by batch processing Allreduce data communications.

Many ML frameworks(including TensorFlow) support Horovod. T ensorFlow distribution strategy also utilizes NCCL and provides an alternative method for distributed TensorFlow training using Horovod. Horovod is used in this article.

Training large DNNs(such as Mask R-CNN) requires high memory requirements for each GPU, so that one or more high-resolution images can be pushed through the training pipeline. They also require high-speed GPU-to-GPU interconnects and high-speed network interconnect machines to efficiently synchronize Allreduce gradients. Amazon SageMaker ml.p3.16xlarge and ml.p3dn.24xlarge instance types can meet all these requirements. For more information, please refer to Amazon SageMaker ML Instance Type . They have eight Nvidia Tesla V100 GPU , 128 256 GB GPU memory, 25 100 Gbs network interconnection and High-speed Nvidia NVLink GPU-to-GPU interconnect, ideal for distributed TensorFlow training on Amazon SageMaker.

Messaging interface

The next challenge of distributed TensorFlow training is to reasonably arrange the training algorithm processes on multiple nodes and to associate each process with a unique global ranking. Message Passing Interface(MPI) is an aggregation communication protocol that is widely used in parallel computing, and is very important in managing a set of training algorithm work processes across multiple nodes it works.

MPI is used to arrange training algorithm processes on multiple nodes and associate each algorithm process with a unique global and local ranking. Horovod is used to logically pin the algorithmic process on a given node to a specific GPU. Gradient synchronization Allreduce requires that each algorithm process logic be fixed to a specific GPU.

In this article, the main MPI concept to understand is that MPI uses mpirun on the master node to start concurrent processes on multiple nodes. The master node uses MPI to manage the life cycle of a distributed training process that runs centrally on multiple nodes. To use Amazon SageMaker for distributed training through MPI, you must integrate the native distributed training capabilities of MPI and Amazon SageMaker.

Integrate MPI and Amazon SageMaker distributed training

To understand how to integrate MPI and Amazon SageMaker distributed training, you need to have a good understanding of the following concepts:

  • Amazon SageMaker requires training algorithms and frameworks to be packaged in a Docker image.
  • Docker images must be enabled for Amazon SageMaker training. By using Amazon SageMaker container , the activation can be simplified, and the container as a library helps to create Docker image with Amazon SageMaker enabled.
  • You need to provide entry point scripts(usually Python scripts) in Amazon SageMaker training images to act as an intermediary between Amazon SageMaker and your algorithm code.
  • To start training on a specified host, Amazon SageMaker will run a Docker container from the training image, and then use entry point environment variable that provides information(such as hyperparameters and input data location) .com/en_pv/sagemaker/latest/dg/docker-container-environmental-variables-entrypoint.html) call the entry point script.
  • The entry point script uses the information passed to it in the entry point environment variable to start the algorithm program with the correct args and poll the running algorithm process.
  • If the algorithm process exits, the entry point script will use the exit code of the algorithm process to exit. Amazon SageMaker uses this exit code to determine the success of the training job.
  • The entry point script will redirect the stdout and stderr of the algorithm process to its own stdout. In turn, Amazon SageMaker captures the stdout from the entry point script and sends it to Amazon CloudWatch Logs . Amazon SageMaker parses the output of stdout for algorithm metrics defined in training operations , and then Send metrics to Amazon CloudWatch Metrics .
  • When Amazon SageMaker starts a training job that requests multiple training instances, it creates a set of hosts, and then logically names each host algo-k, where k is the global ranking of the host. For example, if a training job requests four training instances, Amazon SageMaker will name the hosts algo-1, algo-2, algo-3, and algo-4, respectively. On the network, the host can use these host names to connect.

If distributed training uses MPI, you need one running on the master node(host) and controlling the distribution across multiple nodes(from algo-1 to algo-n, where n is the one requested in your Amazon SageMaker training job Mpirun command for the life cycle of all algorithm processes). However, Amazon SageMaker is not aware of MPI, or any other parallel processing framework that you may use to distribute algorithmic processes across multiple nodes. Amazon SageMaker will call the entry point script on the Docker container running on each node. This means that the entry point script needs to know the global ranking of its nodes and execute different logic based on whether it is called on the master node or other non-master nodes.

Specifically, for MPI, the entry point script called on the master node needs to run the mpirun command to start the algorithm process of all nodes in the host set of the current Amazon SageMaker training job. When invoked by Amazon SageMaker on any non-primary node, the same entry point script periodically checks whether the algorithm process on the non-primary node remotely managed by mpirun from the primary node is still running, and exits when it is not running.

The master node in MPI is a logical concept, which depends on the entry point script designating a host as the master node among all the hosts in the current training job. This designation must be done in a decentralized manner. One of the simple methods is to designate algo-1 as the master node and all other hosts as non-master nodes. Because Amazon SageMaker provides each node with its logical host name in the entry point environment variable, the node can intuitively determine whether it is a master node or a non-master node.
Included in the attached GitHub repository and in [Tensorpack Mask/Faster-RCNN]( https://github .com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) algorithm The packaged in the Docker image follows the logic outlined in this section.

With this background in conceptual understanding, you can continue with step-by-step tutorials on how to use Amazon SageMaker to run distributed TensorFlow training for Mask R-CNN.

Solution overview

This tutorial has the following key steps:

  1. Use the AWS CloudFormation automation script to create a private Amazon VPC and an [Amazon SageMaker notebook instance]attached to this private VPC( https://docs . network.
  2. In an Amazon Vage network hosted by Amazon SageMaker attached to your private VPC, start a distributed training job from an Amazon SageMaker notebook instance. You can use Amazon S3 , Amazon EFS and Amazon FSx as a data source for the training data pipeline.


The following are the prerequisites that must be met:

  1. Create and activate an AWS account or use an existing AWS account.
  2. Manage Amazon SageMaker instance limits . You need at least two ml.p3dn.24xlarge or two ml.p3.16xlarge instances, it is recommended that each service is limited to four. Remember, each AWS region has specific service limits. This article uses us-west-2.
  3. Clone the GitHub repository of this article and follow the steps in this article. All paths in this article are relative to the root of the GitHub repository.
  4. Use any AWS region that supports Amazon SageMaker, EFS, and Amazon FSx. This article uses us-west-2.
  5. Create a new S3 bucket or select an existing one.

Create an Amazon SageMaker notebook instance attached to a VPC

The first step is to run the AWS CloudFormation automation script to create an Amazon SageMaker notebook instance attached to a private VPC. To run this script, you need to have IAM users permissions that match the functions of the network administrator. If you do not have such permissions, you may need to seek the help of a network administrator to run the AWS CloudFormation automation script in this tutorial. For more information, please refer to AWS Functions for Job Functions .

Use the AWS CloudFormation template cfn-sm.yaml to create a AWS CloudFormation stack , and the stack will create an additional Notebook instance of private VPC. You can use [[cfn-sm.yaml]]( in the AWS CloudFormation service console) To Create AWS CloudFormation Stack , or you can customize Variables in the script, and run the script in any location where AWS CLI is installed.

To use the AWS CLI method, perform the following steps:

  1. Install AWS CLI and [Configure it]( https://docs .
  2. In, set AWS_REGION and S3_BUCKET as your AWS region and your S3 bucket, respectively. You will use these two variables.
  3. Or if you want to use an existing EFS file system, you need to set the EFS_ID variable. If EFS_ID is left blank, a new EFS file system will be created. If you choose to use an existing EFS file system, make sure that the existing file system does not have any existing mount targets. For more information, see Managing Amazon EFS File System .
  4. You can also specify GIT_URL to add a GitHub repository to an Amazon SageMaker notebook instance. For GitHub repositories, you can specify GIT_USER and GIT_TOKEN variables.
  5. Run the custom script to create an AWS CloudFormation stack using the AWS CLI.

Save the AWS CloudFormation script summary output for later use. You can also view the output under the AWS CloudFormation Stack Output tab of the AWS Management Console.

Start an Amazon SageMaker training job

Open the notebook instance you created in the Amazon SageMaker console. In this notebook example, there are three Jupyter notebooks that can be used to train Mask R-CNN:

  1. Mask R-CNN notebook, which uses S3 bucket as data source:[[mask-rcnn-s3.ipynb]]( /advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-s3.ipynb).
  2. Mask R-CNN notebook, which uses the EFS file system as a data source:[mask-rcnn-efs.ipynb]( /mask-rcnn-efs.ipynb).
  3. Mask R-CNN notebook, which uses the Amazon FSx Lustre file system as the data source:[mask-rcnn-fsx.ipynb]( /distributed_tensorflow_mask_rcnn/mask-rcnn-fsx.ipynb).

For the Mask R-CNN model selected in this article and the COCO 2017 dataset, the training time performance of all three data source options is similar(although not exactly the same). Each data source has a different cost structure. Here are the differences in how long they set up the training data pipeline:

  • For S3 data sources, each time a training job is started, it will use approximately 20 minutes to copy the COCO 2017 dataset from the S3 bucket to the storage volume attached to each training instance.
  • For EFS data sources, it will take approximately 46 minutes to copy the COCO 2017 data set from the S3 bucket to the EFS file system. You only need to copy this data once. During training, data will be input from the shared EFS file system mounted on all training instances via the network interface.
  • For Amazon FSx, it will take about 10 minutes to create a new Amazon FSx Lustre and import the COCO 2017 dataset from your S3 bucket to the new Amazon FSx Lustre file system. You only need to do this once. During training, data will be input from the shared Amazon FSx Lustre file system mounted on all training instances via a network interface.

If you are not sure which data source option is more suitable for you, you can try S3 first. If the training data download time at the beginning of each training job is not acceptable, then explore and choose EFS or Amazon FSx. Do not make assumptions about the training time performance of any data source. Training time performance depends on many factors, and the best practice is to experiment and measure.

In all three cases, the logs and model checkpoint output during training are written to the storage volume attached to each training instance and then uploaded to your S3 bucket when training is complete. The logs are also injected into Amazon CloudWatch during training, and you can check them during training. System and algorithm training metrics are injected into Amazon CloudWatch metrics during training, and you can visualize them in the Amazon SageMaker service console.

Training results

The following figure shows the sample results of the two algorithms on the COCO 2017 dataset after 24 trainings.
You can view the sample results of the TensorPack Mask/Faster-RCNN algorithm below. The following diagram can be split into three buckets:

  1. Illustration of the average accuracy rate(mAP) of the target frame for different parallel cross ratios(IoU) and the size values of small, medium and large objects
  2. Illustration of the average accuracy of prediction(sAP) of object instance segmentation(segm) with different parallel cross ratio(IoU) and small, medium and large object size values
  3. Other indicators related to training loss or label accuracy


You can view the sample results of the optimized AWS Samples Mask R-CNN algorithm below. The aggregate mAP indicator shown in the figure below is almost the same as the previous algorithm, but the progress of convergence varies.


in conclusion

Amazon SageMaker provides a simplified distributed TensorFlow training platform based on Docker, allowing you to focus on ML algorithms without being disturbed by subordinate problems, such as infrastructure availability and scalability mechanisms, and concurrent experiment management. After the model training is complete, you can use Amazon SageMaker's integrated model deployment feature to create an automatically scalable RESTful service endpoint for your model and start testing it.

For more information, see Deploying Models on Amazon SageMaker Hosted Services . If the model is ready, you can seamlessly deploy the model RESTful service to production.

Bottom Picture 2.png