JUNE 18–22, 2017
FRANKFURT AM MAIN, GERMANY

Presentation Details

 
Name: (RP16) Optimizing Massive Data Access for Large Scale Population Genomics Analysis Using HDF5
 
Time: Tuesday, June 20, 2017
08:35 am - 09:45 am
 
Room:   Substanz 1+2  
 
Breaks:07:30 am - 10:00 am Welcome Coffee
 
Presenter:   Hui Yan, NSCCGZ
 
Abstract:  
More and more DNA sequencing data are generated, which enables population scale modeling for both scientific and clinical purposes. The traditional plain organization and layout of these data volumes don’t fit well with large scale analysis. Genotype imputation needs to analyze the same genome region of all individuals, thus small partial data of a large amount of files will be read. Such kind of data access significantly increases the workload of parallel file system, causing performance bottleneck. To tackle this, HDF5 file format is employed as kind of container for these raw data files. Naturally one single HDF5 file for a human chromosome, inside the HDF5 file two layouts are proposed and tested. The first one is one-dimensional, data distributed as different individuals/samples. The second one is two-dimensional, data distributed along both fixed size genome regions and different individuals/samples. Our experiment shows that both layouts improve the performance significantly, 3.4x speedup is observed. And two-dimensional layout performs even better because the feasibility to locate a certain region. It is clear that our work solves the metadata congestion as well as improvement in data access performance.

Authors:
Junrong Yang, South China University of Technology
Peihao Liu, National University of Defense Technology
Guixin Guo, National Supercomputer Center in Guangzhou
Hanquan Liang, National Supercomputer Center in Guangzhou
BingQiang Wang, National Supercomputer Center in Guangzhou
Shoubin Dong, South China University of Technology
Yutong Lu, National Supercomputer Center in Guangzhou
 
 
Download

RP16_Yan.pdf (864 KB)