A high standard deviation indicates that the metric varies substantially throughout the job’s execution

Still, our data hints that memory bandwidth is another resource that may be over designed by average in HPC. However, we cannot relate our findings to the impact to application performance if we were to reduce available memory bandwidth. That is because applications may exhibit brief but important phases of high memory bandwidth usage that may be penalized if we reduce available memory bandwidth.Figure 4 shows a CDF of average idle time among all compute cores in nodes, expressed as a percentage. These statistics were generated using “proc” reports of idle kernel cycles for each hardware thread. To generate a sample for each node every 1s, we average the idle percentage of all hardware threads in each node. As shown, about half of the time Haswell nodes have at most a 49.9% CPU idle percentage and KNL nodes 76.5%. For Haswell nodes, average system-wide CPU idle time in each sampling period never drops lower than 28% in a 30 s period and for KNL 30% . These statistics are largely due to the two hardware threads per compute core in Haswell and four in KNL, because in Cori 80% of the time Haswell nodes use only one hardware thread and 50% in KNL. Similarly, many jobs reserve entire nodes but do not use all cores in those nodes. Datacenters have also reported 28%–55% CPU idle in the case of Google trace data and 20%–50% most of the time in Alibaba.We measure per-node injection bandwidth at every NIC by using hardware counters in the Cray Aries interconnect. Those counters record how many payload bytes each node sent to and received from the Aries network. We report results for each node as a percentage utilization of the maximum per-node NIC bandwidth of 16 GB/s per direction.

We also verify against similar statistics generated by using NIC flit counters and multiplying by the flit size. In Cori,vertical farming supplies access to the global file system uses the Aries network so our statistics include file system accesses. Figure 4 shows a CDF of node-wide NIC bandwidth utilization. As shown, 75% of the time Haswell nodes use at most 0.5% of available NIC bandwidth. For KNL nodes the latter percentage becomes 1.25%. In addition, NIC bandwidth consistently exhibits a sustained bursty behavior. In particular, in a two-week period, sustained 30s average NIC bandwidth in about 60 separate occurrences increased by more than 3× compared to the overall average.In this section, we analyze how much metrics change across a job’s lifetime. Figure 5 shows a CDF of the standard deviation of all values throughout each job’s execution, calculated separately for different metrics. This graph was generated by calculating the standard deviation of values for each metric that each job has throughout the job’s execution. To normalize for different absolute values of each job, standard deviation is expressed as a percentage of the per-job average value for each metric. A value of 50% indicates that the job’s standard deviation is half of the job’s average for that metric. As shown, occupied memory and CPU idle percentages do not highly vary during job execution, but memory and NIC bandwidths do. The variability of memory and NIC bandwidths is intuitive, because many applications exhibit phases of low and high memory bandwidth. Network and memory bandwidth have been previously observed to have bursty behavior for many applications. In contrast, once an application completes reserving memory capacity, the reservation’s size typically does not change significantly until the application terminates. These observations are important for provisioning resources for disaggregation.

For metrics that do not considerably vary throughout a job’s execution, average system-wide or per-job measurements of those metrics are more representative of future behavior. Therefore, provisioning for average utilization, with perhaps an additional factor such as a standard deviation, likely will satisfy application requirements for those metrics for the majority of the time. In contrast, metrics that vary considerably have an average that is less representative. Therefore, for those metrics resource disaggregation should consider the maximum or near-maximum value.The workloads we study are part of the MLPerf benchmark suite version 0.7. MLPerf is maintained by a large consortium of companies that are pioneering AI in terms of neural network design and system architectures for both training and inference. The workloads that are part of the suite have been developed by world-leading research organizations and are representative of workloads that execute in actual production systems. We select a range of different neural networks, representing different applications: Transformer: Transformer’s novel multi-head attention mechanism allows for better parallel processing of the input sequence and therefore faster training times, but it also overcomes the vanishing gradient issue that typical RNNs suffer from. It is for these reasons that Transformers became state-of-the-art for natural language processing tasks. Such tasks include machine translation, time series prediction, as well as text understanding and generation. Transformers are the fundamental building block for networks like bidirectional encoder representations from transformers and GPT. It has also been demonstrated that Transformers are used for vision tasks. BERT: The BERT network only implements the encoder and is designed to be a language model.

Training is often done in two phases; the first phase is unsupervised to learn representations and the second phase is then used to fine-tune the network with labeled data. Language models are deployed in translation systems and human-to-machine interactions. We focus on supervised fine-tuning. ResNet50: vision tasks, particular image classification, were among the first to make DL popular. First developed by Microsoft, ResNet50 is often regarded as a standard benchmark for DL tasks and is one of the most used DL networks for image classification and segmentation, object detection, and other vision tasks. DLRM: the last benchmark in our study is the deep-learning recommendation model . Recommender systems differ from the other networks in that they deploy vast embed ding tables. These tables are sparsely accessed before a dense representation is fed into a more classical neural network. Many companies deploy these systems to offer customer recommendtions based on their history of items they bought or content they enjoyed. All workloads are ran using the official MLPerf docker containers on the datasets that are also used for the official benchmark results. Given the large number of hyperparameters to run these networks, we refer to the docker containers and scripts that are used to run the benchmarks. We only adapt batch sizes and denote them in our results.Machine learning typically consists of two phases: training and inference. During training the net work learns and optimizes parameters from a carefully curated dataset. Training is a throughput critical task and input samples are batched together to increase efficiency. Inference, however, is done on a trained and deployed model and is often sensitive to latency. Input batches are usu ally smaller and there is less computation and lower memory footprints, as no errors need to be back propagated and parameters are not optimized. We measure various metrics for training and inference runs for BERT and ResNet50. Results are shown in Figure 12 . GPU utilization is high for BERT during training and inference phases, both in terms of compute and memory capacity utilization. However, the CPU compute and memory capacity utilization is low. ResNet50, however,cannabis indoor greenhouse shows large CPU compute utilization, which is also higher during inference. Inference requires significantly less computation, which means the CPU is more utilized compared to training to provide the data and launch the work on the GPU. Training consumes significantly more GPU resources, especially memory capacity. This is not surprising, as these workloads were designed for maximal performance on GPUs. Certain parts of the system, notably CPU resources, remain underutilized, which motivates disaggregation.Further, training and inference have different requirements and disaggregation helps to provision resources accordingly. The need for disaggregation is also evident in NVIDIA’s introduction of multi-instance GPU, which allows to partition a GPUs into seven independent and smaller GPUs. Our work takes this further and considers disaggregation at the rack scale.Figure 12 shows resource utilization during the training of various MLPerf workloads.

These benchmarks are run on a single DGX1 system with 8 Volta V100 GPUs. While we can generally observe that CPU utilization is low, GPU utilization is consistently high across all workloads. We also depict bandwidth utilization of NVLink and PCIe. We note that bandwidth is shown as an average effective bandwidth across all GPUs for the entire measurement period. In addition, we can only sample in intervals of 1 s, which limits our ability to capture peaks in high-speed links like NVLink. Nonetheless, we can observe that overall effective bandwidth is low, which suggests that links are not highly utilized by average. All of the shown workloads are data-parallel, while DLRM also implements model parallelism. In data parallelism, parameters need to be reduced across all workers, resulting in an all-reduce operation for every optimization step. As a result, the network can be underutilized during computation of parameter gradients. The highest bandwidth utilization is from DLRM, which is attributed to the model parallel phase and its all-to-all communication.Another factor of utilization is inter-node scaling. We run BERT and ResNet50 on up to 16 DGX1 systems, connected via InfiniBand EDR. The results are depicted in Figure 13. For ResNet50, we also distinguish between weak scaling and strong scaling. BERT is shown for weak scaling only. Weak scaling is preferred as it generally leads to higher efficiency and utilization, as shown by our results. In data parallelism, this means we keep the number of input samples per GPU, referred to as sub-batch , constant, and while we scale-out the effective global batch size increases. At some point the global batch size reaches a critical limit after which the network stops converging or converges slower so that any performance benefit diminishes. At this point, strong scaling becomes the only option to further reduce training time. As Figure 13 shows, strong scaling increases the bandwidth requirements, both intra- and inter-node, but reduces compute and memory utilization of individual GPUs. While some under utilization is still beneficial in terms of total training time, it eventually becomes too inefficient. Many neural nets train for hours or even days on small-scale systems, rendering large-scale train ing necessary. This naturally leads to some underutilization of certain resources, which motivates disaggregation to allow resources to be used for other tasks.To understand what our observations mean for resource disaggregation, we assume a Cori cabinet that contains Haswell nodes. That is, each cabinet has 384 Haswell CPUs, 768 memory modules of 17 GB/s and 16 GB each, and 192 NICs that connect at 16 GB/s per direction to nodes . We use memory bandwidth measurements from Haswell nodes, and memory capacity and NIC band width from both Haswell and KNL nodes. Since a benefit of resource disaggregation is increasing the utilization factor that can help to reduce overall resources, we define and sweep a resource reduction factor percentage for non-CPU resources. For instance, a resource reduction factor of 50% means that our cabinet has half the memory and NIC resources of a Cori Haswell cabinet. The resource reduction factor is a useful way to illustrate the tradeoff between the range of disaggregation and how aggressively we reduce available resources. We assume resource disaggregation hardware capable of allocating resources in a fine-grain manner, such as fractions of memory capacity of the same memory module to different jobs or CPUs.We perform two kinds of analyses. The first anal ysis is agnostic to job characteristics and quantifies the probability that a CPU will have to connect to resources in other racks. This more readily translates to inter-rack bandwidth requirements for system-wide disaggregation. To that end, we make the assumption that a CPU can be allocated to any application running in the system with uniform random probability. We use the CDFs of resource usage from Section 3 as probability distributions. As shown in Figure 14 , with no resource reduction a CPU has to cross a rack 0.01% of the time to find additional memory bandwidth, 0.16% to find additional memory capacity, and 0.28% to find additional NIC bandwidth. With a 50% resource reduction factor, these numbers become 20.2%, 11%, and 2%, respectively.