The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility
of orchestrating these complex components is often left to one-off scripts and glue code customized for specific problems. To address these restrictions, we introduce Alchemist – an internal service built at Apple from the ground up for easy, fast, and scalable distributed training. We discuss its design, implementation, and examples of running
different flavors of distributed training. We also present case studies of its internal adoption in the development of autonomous systems, where training times have been reduced by 10x to keep up with the ever-growing data collection.
Some of the key challenges with engineering Deep Learning models include
- Both batch and interactive modes have to be supported while training
- POSIX interface is often preferred by the users for its simplicity while interacting with datasets.
- The software stack to train DNNs has become more complex than ever,
ranging from low-level GPU libraries(CUDA, CUDNN) to high-level frameworks
- Integrating these software tools comes with the dependency management overhead
- The software environment has to be the same on the cloud and the engineer’s desk to simplify the development, debugging and code sharing
To address the above challenges,Apple engineers have built a system known as Alchemist. Alchemist adopts a cloud-native architecture and is portable among private and public clouds. It supports multiple training frameworks like Tensorflow or PyTorch and multiple distributed training paradigms. The compute cluster is managed by, but not limited to, Kubernetes 2.
Storage Layer. This layer contains the distributed file systems and object storage systems that maintain the training and evaluation datasets, the trained model artifacts, and
the code assets. Depending on the size and location of the datasets and annotations, there are many options ranging from using solely NFS, solely an object store, or a mixture
of different storage systems.
Compute Layer. This layer contains the compute-intensive workloads; in particular, the training and evaluation tasks orchestrated by the job scheduler. To improve compute resource utilization, a cluster auto-scaler watches the job queues and dynamically adjusts for the size of the cluster.
API Layer. This layer exposes services to users, which allows them to upload and browse the code assets, submit distributed jobs, and query the logs and metrics.
Gang scheduling. A scheduler allocates and manages the set of tasks required for a training job. A submitted job is placed in a job queue for the scheduler to process. When the required CPU, GPU, memory, and storage resources become available, the scheduler
launches the necessary task containers. It then monitors their status, and in case one fails, it will either terminate all of the job’s tasks or re-launch the failed task.
Storage System Choices
- There are mainly four types of files stored in the system: container images, training code, datasets and model artifacts.
- Container images are treated separately and stored in an image registry.
- Users’ code is stored in an object storage system.
- For the other two, it allows the use of either an object store or a shared filesystem.
- For small datasets, users can access an object storage systems by downloading
the datasets before training.
- For medium sized datasets, a shared distributed filesystem with a POSIX compatible
interface can be mounted inside task containers.
- For large datasets and large-scale hyper parameter tuning jobs, a high performance
in-memory storage system or streaming system is often preferred.
- Leveraging gang scheduling, they are able to launch task containers on memory-optimized instances alongside GPU instances to cache and stream the datasets as the GPU instances train the model.
Alchemist automatically scales the compute cluster based on job requests and utilization statistics. The auto scaling service optimized for distributed training.
Looks like Alchemist has been adopted by internal teams doing research and development in autonomous systems especially in perception systems. Due to a huge diversity of training examples and complexity of the models, DNN training requires an efficient machinery to enable rapid development cycle. Tasks such as 2D object detection in images, 3D object detection in point cloud, Semantic segmentation etc. are some of the areas where Alchemist seems to have helped extensively by reducing the training time.
Read the original paper here.