# **LOWAIN Project** Low Arithmetic INtensity specific architectures

# Running HPCG is computationally inefficient LINPACK Eff. [%] [PFlop/s] [PFlop/s] [PFlop/s] 2.93



## LOWAIN assumptions and goals

#### **LOWAIN assumptions:**

- a simulation specific architecture is economically justified
- most simulation programs behave in a way similar to HPCG

# LOWAIN goal: "Exascale-equivalent" computer

#### Summit-like exascale Perform. estim. [PFlop/s]: 1000 DP peak SP peak 2000 DP HPCG ~15

Simulations (DP) ~15-30

Simulations (SP) ~50-60

"Exascale-equivalent" Perform. estim. [PFlop/s]: DP peak 30-50 SP peak 60-100 DP HPCG ~15 Simulations (DP) ~15-30 Simulations (SP) ~50-60

#### F/B of Matrix-Vector Product

$$A_0 = M_{00}*a_0 + M_{01}*a_1 + M_{02}*a_2 + M_{03}*a_3$$

$$A_1 = M_{10}*a_0 + M_{11}*a_1 + M_{12}*a_2 + M_{13}*a_3$$

$$A_2 = M_{20}*a_0 + M_{21}*a_1 + M_{22}*a_2 + M_{23}*a_3$$

$$A_3 = M_{30}*a_0 + M_{31}*a_1 + M_{32}*a_2 + M_{33}*a_3$$
Each matrix element used only once (all accesses result in cache misses)
Only two operations (MPY and ADD) done with any **non-zero** matrix element. (vector loads not considered)

Flop/Byte of Matrix-Vector Product 2 operations/8 byte number < **0.25** 

DP HPCG Flop/Byte ratio is similar

# Poor HPCG behavior is caused by low Flop/Byte ratio

Peak, LINPACK, and HPCG performance of the Top10 supercomputers (November 2018) - a graph and a table

|                           | Memory    | Enough Data  | Peak        | Bound to   |
|---------------------------|-----------|--------------|-------------|------------|
| Processor                 | Bandwidth | for DP HPCG  | Performance | Efficiency |
|                           | [GB/s]    | [GFlop/s]    | [GFlop/s]   | [%]        |
| NVIDIA Volta-100          | 900       | 0.25*900=225 | 7800        | 2.88       |
| Volta-100/NVLink          | 300       | 0.25*300= 75 | 7800        | 0.96       |
| Intel Xeon Phi "KNL"      | 480+120   | 0.25*600=150 | 3000        | 5.00       |
| KNL (using external DRAM) | 120       | 0.25*120= 30 | 3000        | 1.00       |

The processor-memory bandwidth performance limit and the peak performance

#### The first LOWAIN phase

The processor peak performance can not be fully used

The LOWAIN program suggests

reducing the computing power and/or the number of cores of processors. The first LOWAIN research goal is to determine how much

by measuring Flop/Byte ratio of simulation programs.

### Exploiting Flop/Byte ratio

| Computer |                      |     | efficiency | % of use of the<br>memory bandwidth<br>bound [%] |
|----------|----------------------|-----|------------|--------------------------------------------------|
| SX-ACE   | Fujitsu SX-ACE       | 25  | 11         | 44                                               |
| K        | Fujitsu SPARC VIIIfx | 12  | 6          | 50                                               |
| Cori     | Intel Xeon Phi "KNL" | 5.0 | 1.5        | 30                                               |
| Summit   | NVIDIA Volta-100     | 2.9 | 1.5        | 52                                               |

The percentage of the use of the memory bandwidth when running the HPCG

#### The second LOWAIN phase

The real processor simulation performance is substantially worse than the memory bandwidth upper bound.

The LOWAIN project suggests

using an intelligent memory controller

to make full use of the memory bandwidth upper bound.

kage, etc.

#### Weather Research & Forecast



Flop/Byte of Microphysics Driver of Weather Research & Forecast as a function of the cache size (Single Precision configuration)
A LOWAIN 1st phase result; input "Central Europe, June 6, 2013"
Presented at General Assembly of European Geosci. Union, April 20

# Features of the Exascale-Equivalent Architecture

# Very wide and fast memory bus

to guarantee very high memory bandwidth It is not the goal of LOWAIN to prepare a HW design of a high-bandwidth memory bus, but to suggest measures to use a given bus optimally.

## **Reduced Number and/or Power of Processor Cores**

1.5

1.4

0.4

0.6

1.3

1.6

1.3

1.8 1.2

0.8

Just as many cores as the memory bandwidth would keep busy

Simpler and and/or more space Optionally using less for caches advanced CMOS process cheaper processors Using 28 nm CMOS proces mastered in Europe Higher Lower leato make a fully European processor

DO I=1,X <-B(1)A(2\*I) = B(I)A(2)->C(I+1) = D(I)+2 < -D(1)**ENDDO** C(2)-><-B(2)

**Intelligent memory controller** Necessary to use efficiently the limited memory bandwidth. The standard pre-fetching and cache-miss procedures are too weak to take full use of simulation specific features

Main Program Backbone Off-processor controller running the load/store backbone of the main program to deliver a data stream to/from the processor optimally and just-in-time. Very limited communication with the program cores. The present LOWAIN research shows that, in simulation programs, the backbone can run well ahead of the main program most of the time, and hence it has enough time to prepare the data flow for the processor

# Pursued Approach and Methodology

power

### 1st Phase

Using standard profiling tools to measure execution times, the number of executed operations and the number of loads/stores across the processor-memory interface can be measured to determine the flop/byte ratio of studied programs. However, the number of loads/stores across the processor-memory interface depends on the cache sizes that are fixed when profiling at a given computer. Therefore, an emulator of a plain or optimized code with variable cache size is being developed for exact measuring of the flop/byte ratio dependence on the cache size

### 2nd Phase

- Study of patterns of processor-memory data traffic that are specific for computer simulations listed above and use them to design memory handling algorithms.

A(3)->

<-D(2)

- Extend the emulator, developed in the first phase, to study the behavior and properties of different intelligent memory computers implementing the algorithms of the previous paragraph.
- Insert a low level model of the RISC-V architecture to the emulator to verify details of the LOWAIN processor design

# The LOWAIN Project Roadmap

# 1st Phase (Jan 2019 - June 2020) An emulator with variable cache size (June 2019) Analysis of mechanical

Analysis of NWP & climate programs (WRF, RegCM, ECMWF, ESiWACE)

Analysis of **CFD Programs** (OpenFOAM, NEK5000, Fluent deformation programs (PAM-Crash)

Analysis of other simulation programs (combustion,...)

# 2nd Phase (Jan 2020 - June 2021)

study of patterns of processor-memory data traffic (June 2020)

The 1st phase emulator extended to a model with an abstract memory controller (Oct 2020)

Evaluation of variants of smart memory controllers (Feb 2021)

Extension of the model by a low level model of RISC-V cores for final verification of the computer

# The LOWAIN Project Consortium

Czech Technical University, Faculty of Information Technologies (Coordinator) Czech Technical University, Faculty of Mechanical Engineering Charles University, Department of Atmospheric Physics Charles University, Department of Applied Mathematics Skoda Auto, a.s.

Mecas ESI (suggested)

Codasip, s.r.o. (2nd phase)

# **Poster Author**

Ludek Kucera LOWAIN Project Czech Technical University & Charles University Prague, Czech Republic ludek@kam.mff.cuni.cz