# Reduction Operations on Modern Supercomputers: Challenges and Solutions

Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Hari Subramoni, Dhabaleswar K. Panda

# Importance of Reduction

### **Operations:**

verview

C

Challenges

esigns

Proposed

lmp

- One of the most popular MPI collectives
- Widely used in Deep Learning frameworks and Scientific applications
- Extensive usage of compute resources as well as network

# **Research Challenges:**

- Efficient usage of network offload mechanisms and high-throughput network
- Enhanced usage of one-sided semantics and cache locality
- Efficient pipeline and overlap across various design phases
- Dynamic and adaptive communication

# Approaches

- Onloading approach: CPU-assisted approach
- Offloading approach: using HCA (CORE-Direct) or Switch (SHArP)

Small Messages

### Scalable Hierarchical Aggregation Protocol (SHArP)

Manipulation of data while it is being transferred in the switch network



# Challenges

- Current designs are not NUMA-aware
- Limited performance due to extra cross socket transfers

Courtesy Mellanox Technologies

Low performance for medium and large message ranges

# Naive SHArP Design

- SHArP only used in inter-node reduction operation
- Step I: Intra-node reduction by one process in each node
- Step 2: Then Inter-node Allreduce using SHArP
- Step 3: Broadcast the final results from node-leader to other processes

# NUMA-Aware SHArP Design (a)

- Mixture of the CPU-assisted designs with Offloaded approaches
- Topology-aware (hierarchal): Two-level designs
- Introducing socket-level leader process to to limit the QPI transfers
- Allowing the leader process in each socket to use SHArP
- Using CPU for intra-socket reduction operations



a Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, **Bayatpour** et al, Supercomuting' 17, Denver, Co. C Baidu Allreduce Design: https://github.com/baidu- research/baidu-allreduce References: **b** SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives, Bayatpour et al, IEEE Cluster'18, Belfast. UK

# **Proposed Solutions:**

- Enhanced SHArP network offload • Target: Small Messages
- Data Partitioning-based Multi-leader design • Target: Medium Messages
- **XPMEM/SHMEM-based Scalable and** adaptive design
- Target: Large Messages

# Medium Messages

### Approaches

- Topology-aware (hierarchal): Two-level designs (intra-node reduce + inter-node Allreduce)
- Flat designs: Tree-based designs

#### **Communication Characteristics of Modern Architectures**



### Challenges

- Do not take advantage of high concurrency in new architectures (Hierarchical designs)
- Too many inter-node communication and deep hierarchy (Treebased designs)

# Data Partitioning based Multi-Leader (DPML) Designa

- Having shallow hierarchies with small depth and large number of children per parent
- Taking advantage of high-throughput of concurrent medium messages



#### Performance of DPML Designs on KNL+Omni-Path MiniAMR (32 PPN) MPI Allreduce (4,096 Processes, 64 Nodes, 64 PPN) 70 4000 ■ MVAPICH2 ■ DPML ■ IMPI **S** 3000 ■ MVAPICH2 ■ DPML ■ IMPI <u>ح</u> 50 52% <u>ନ</u>୍ଦି 2000



d



# Approaches

- Intra-node zero copy mechanism
- Pipelined inter-node Allreduce 4.
- **Communication Adaptive**



# Challenges

- Efficient pipeline of various steps and usage of XPMEM/SHMEM Efficient utilization of compute resources in all processes
- Orchestrating the data transfers to effectively utilize the network
- bandwidth without oversubscribing a particular link

# Scalable and Adaptive Designs for Large Messages <u>Reduction Collectives (SALaR)</u>

- 2. SALaR-Inter: An efficient one-sided-based Inter-node Allreduce







Т • Н • Е

# Large Messages

- Inter-node one-sided communications
- Inter-node pipelining with intra-node operations

| •          | Optimization Methods |          |   |          |   |                       |  |  |
|------------|----------------------|----------|---|----------|---|-----------------------|--|--|
| esigns:    | Applicability        | I        | 2 | 3        | 4 | 5                     |  |  |
| educe 🖸    | GPU                  | ×        | × | ~        | ~ | ×                     |  |  |
| elining 🕢  | GPU                  | ×        | × | <b>~</b> | ~ | ×                     |  |  |
| -Allgather | CPU/GPU              | ×        | × | ×        | × | ×                     |  |  |
| I Ring 🙆   | GPU/CPU              | ×        | × | <b>v</b> | ~ | ×                     |  |  |
| Reduction  | CPU                  | <b>~</b> | × | ×        | × | ×                     |  |  |
| SALaR"     | CPU                  | ~        | ~ | ~        | ~ | <ul> <li>✓</li> </ul> |  |  |

I. SALaR-SHMEM/XPMEM: A pipelined Allreduce design which uses XPMEM/SHMEM for intra-node reduction and SALaR-Inter for inter-node reduction. Intra-node operation is overlapped with inter-node operation.

#### SALaR-SHMEM Timeline

| e Chunk {i-1} Bcast Chunk           |      |                      | Inter-node Allreduce Chu         | Bcast<br>Chunk {i} |                    |  |  |
|-------------------------------------|------|----------------------|----------------------------------|--------------------|--------------------|--|--|
| ace                                 | Wait | Bcast Chunk<br>{i-1} | Intra-node Reduce<br>Chunk {i+1} | Wait               | Bcast<br>Chunk {i} |  |  |
| uce                                 | Wait | Bcast Chunk<br>{i-1} | Intra-node Reduce<br>Chunk {i+1} | Wait               | Bcast<br>Chunk {i} |  |  |
| ***                                 |      |                      | ***                              |                    |                    |  |  |
| uce                                 | Wait | Bcast Chunk<br>{i-1} | Intra-node Reduce<br>Chunk {i+1} | Wait               | Bcast<br>Chunk {i} |  |  |
|                                     |      |                      |                                  |                    |                    |  |  |
| Iteration {i}<br>rocesses in Node 0 |      | ode 0                | lteration {i+1}                  |                    |                    |  |  |

