# A First-Principles Approach to Performance and Power Models for Contemporary Multi- and Many-Core Processors

Friedrich-Alexander-University Erlangen-Nuremberg, Computer Architecture Johannes Hofmann <johannes.hofmann@fau.de>

Vision: Understand performance, power, and energy properties of modern processors and recommend best-practices

# Improve Execution-Cache-Memory (ECM) Performance Model<sup>[1-4]</sup>

- Separate general principles and microarchitecture-dependent behavior into general ECM model and ECM machine models
- Model additional hardware designs
  - Write-through caches
  - Streaming stores
  - Cluster on Die mode
  - Separate frequency domains
- Devise machine models for
  - Intel Haswell- and Broadwell-EP Intel Skylake-SP
  - IBM POWER8
- Victim caches
- Partially and fully overlaping data transfers
- Dynamic Uncore frequency

- AMD Zen (Ryzen, Epyc)
- Reduce ECM model error near the saturation point

# **Quantitative Power Model**<sup>[5]</sup>

Quantitative power estimate s.t. active core count, CPU core and Uncore frequencies for arbitrary scalable and saturating steady-state codes

- Full chip power given by sum of baseline and aggregate core power  $P_{\rm chip}(f_{\rm Uncore}, f_{\rm core}, \varepsilon_n) = P_{\rm base}(f_{\rm Uncore}) + nP_{\rm core}(f_{\rm core}, \varepsilon_n)$
- Baseline power depends on processor's Uncore frequency  $P_{\text{base}}(f_{\text{Uncore}}) = p_0^{\text{base}} + p_1^{\text{base}} f_{\text{Uncore}} + p_2^{\text{base}} f_{\text{Uncore}}^2$
- Per-core power depends on CPU core frequency, code scalability  $P_{\text{core}}(f_{\text{core}},\varepsilon_n) = \left(p_0^{\text{core}} + p_1^{\text{core}}f_{\text{core}} + p_2^{\text{core}}f_{\text{core}}^2\right) \cdot \varepsilon_n^{\alpha}$

by an order of magnitude

#### selected examples



## • Validated on Intel Sandy and Ivy Bridge, Haswell, Broadwell, AMD Zen

## Model setup

1. Derive baseline power s.t. Uncore freq. by linear extrapolation towards 0 cores

- 2. Fit baseline power parameters using the data obtained in (1)
- 3. Derive core power contribution from empirical data by subtracting baseline estimate (2), then fit core parameters

3.5 [%] 3 2.5

ative



| Model validation     |                      |                        |                     |                 |     | Relative model error for DGEMM |     |     |     |     |     |  |  |  |
|----------------------|----------------------|------------------------|---------------------|-----------------|-----|--------------------------------|-----|-----|-----|-----|-----|--|--|--|
| C                    | GEMM on Xeon E5-2680 | STREAM on Xeon E5-2680 |                     | on Xeon E5-2680 |     |                                |     |     |     |     |     |  |  |  |
| <mark>- 125</mark> ן |                      | <u>125</u>             | 2.7                 | 0.8             | 0.1 | 0.1                            | 0.8 | 0.6 | 0.5 | 0.8 | 1.8 |  |  |  |
| <b>.</b> -           | O Measurement        | _ ○ Measurement _      | ר <u>א</u> 2.5      | 0.8             | 0.1 | 0.2                            | 0.3 | 0.2 | 0.3 | 1.3 | 1.3 |  |  |  |
| 100-                 |                      |                        | <u> </u>            | 0.6             | 0.0 | 0.4                            | 0.0 | 0.3 | 0.0 | 0.8 | 0.9 |  |  |  |
|                      | XL & CHZ             |                        | $\simeq 2.3$        | 1.1             | 0.7 | 0.1                            | 0.3 | 0.0 | 0.1 | 0.5 | 0.8 |  |  |  |
| -<br>,               |                      | GHZ CITZ               | C 2.2               | 1.0             | 0.8 | 0.8                            | 0.5 | 0.1 | 0.4 | 0.7 | 0.6 |  |  |  |
| ? 75                 | - 2                  | -2.7 = 2.2  GHz - 75   |                     | 1.0             | 0.4 | 0.1                            | 0.8 | 0.0 | 0.4 | 0.7 | 0.3 |  |  |  |
| -                    | - tore tore 7 GHZ    | - fore fore - CHZ      | a 1.9               | 0.6             | 1.0 | 0.8                            | 0.6 | 0.1 | 0.9 | 0.2 | 0.9 |  |  |  |
| 50-                  |                      | f re= 1.7 01.4         | ÷ 1.8               | 1.1             | 1.0 | 1.1                            | 0.9 | 0.9 | 0.8 | 0.5 | 0.9 |  |  |  |
|                      | - Core of -          |                        | <del>لا</del> 8 1.7 | 0.7             | 0.0 | 0.1                            | 0.3 | 0.6 | 0.5 | 0.2 | 0.8 |  |  |  |
| -                    |                      |                        | <u> </u>            | 1.0             | 0.0 | 0.7                            | 0.6 | 0.8 | 1.3 | 0.3 | 1.0 |  |  |  |



## **Derived Energy Model**<sup>[5]</sup>

An analytic energy model can be derived by combining the ECM model estimate  $\Pi_{\rm ECM}$  and the power model estimate  $P_{\rm chip}$ 

$$E(f_{\text{Uncore}}, f_{\text{core}}, n) = \frac{P_{\text{chip}}(f_{\text{Uncore}}, f_{\text{core}}, n, \varepsilon_n)}{\Pi_{\text{ECM}}(f_{\text{Uncore}}, f_{\text{core}}, n)}$$

## Analytic deductions

Minimum energy w.r.t. number of active cores n

• scalable codes: use all available cores

$$\frac{\partial E}{\partial n} = -\frac{P_{\text{base}}(f_{\text{Uncore}})}{-n^2 \cdot \Pi_{\text{ECM}}(f_{\text{core}}, f_{\text{Uncore}})} < 0$$

### Model validation

MEM

10<sup>5</sup>

time

core 0

core 0

core

core 0

core 1

core 2

10<sup>4</sup>



## **Empirical observations and best practices**

• saturating codes: use number of cores n required to saturate bottleneck

$$E = \frac{P_{\text{base}}(f_{\text{Uncore}}) + n\varepsilon^{\alpha} \cdot P_{\text{core}}(f_{\text{core}}, 1)}{\Pi_{\text{Sat}}}$$

Minimum energy w.r.t. CPU core frequency f

• Optimum frequency subject to clock domains

 $f_{\rm core}^{\rm opt}(n) = \sqrt{\frac{p_0^{\rm base} + np_0^{\rm core}}{p_2^{\rm base} + np_2^{\rm core}}}$ 





#### **TECHNISCHE FAKULTÄT**

#### References

- [1] J. Hofmann, D. Fey, M. Riedmann, J. Eitzinger, G. Hager, and G. Wellein: Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors. Concurrency and Computation: Practice and Experience, ISSN: 1532-0634
- [2] J. Hofmann, D. Fey, J. Eitzinger, G. Hager, and G. Wellein: Analysis of Intel's Haswell Microarchitecture Using the ECM Model and Microbenchmarks Architecture of Computing Systems -ARCS 2016: 29th International Conference, Nuremberg, Germany, April 4-7, 2016
- [3] J. Hofmann, D. Fey: An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors. 4th International Workshop on Energy Efficient Supercomputing, Salt Lake City, UT, USA, November 14, 2016
- [4] J. Hofmann, G. Hager, G. Wellein, D. Fey: An analysis of core- and chip-level architectural features in four generations of Intel server processors. High Performance Computing: 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18-22, 2017
- [5] J. Hofmann, G. Hager, D. Fey: On the accuracy and usefulness of analytic energy models for contemporary multicore processors. Accepted for High Performance Computing: 33nd International Conference, ISC High Performance 2018, Frankfurt, Germany, June 24-28, 2018