BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200227T164259Z
LOCATION:Analog 1\, 2
DTSTART;TZID=Europe/Stockholm:20190619T101000
DTEND;TZID=Europe/Stockholm:20190619T110000
UID:isc_hpc_ISC High Performance 2019_sess251_post111@linklings.com
SUMMARY:(RP28) Performance Tuning of Deep Learning Framework Chainer on th
e K Computer.
DESCRIPTION:HPC in Asia\n\n(RP28) Performance Tuning of Deep Learning Fram
ework Chainer on the K Computer.\n\nKuroda, Kumahata, Chiba, Takashina, Mi
nami\n\nRecently the applications and research of machine learning by deep
learning has become popular using GPU. However, it seems possible to do m
any calculations using CPUs of massively parallel computers. Here, we intr
oduce some performance tuning procedures for Chainer, which is a represent
ative framework for utilization of machine learning on the K computer.\nCh
ainer expresses the hierarchical structure of deep learning using Python,
and all calculations can be realized using numPy without special libraries
. By optimizing floating point underflow exception when building Python, e
lapsed time was improved to 1/3.39. Moreover, by replacing the SSL2 gemm l
ibrary called by Python with the thread-parallel version, section elapsed
time was improved to 1/4.54, the total elapsed time was improved to 1/1.15
, and the performance efficiency was improved about 47.0%.\nMany of the co
st was the calculation of the square root and the arithmetic when the filt
er was updated and activation functions. These operations are not optimize
d when calculated using numPy and are particularly slow on the K computer.
By replacing the kernel with software pipelining and SIMD optimization by
Fortran library, the kernel elapsed time was improved to 1/11.08 and tota
l elapsed time was improved to 1/16.23.\nThere are some limitations on the
use of Chainer on the K computer. However, it can be said that deep learn
ing calculation became possible on the K computer and the Post-K computer
using these tuning effect and CPU parallel version Chainer.\n\nPasses: Con
ference Pass, AI/Machine Learning/Deep Learning, Performance Analysis and
Optimization\n\nTag: Conference Pass, AI/Machine Learning/Deep Learning, P
erformance Analysis and Optimization
URL:https://2019.isc-program.com/presentation/?id=post111&sess=sess251
END:VEVENT
END:VCALENDAR