h5part AT lists.psi.ch

Subject: H5Part development and discussion

List archive

Re: [H5part] Performance

From: Achim Gsell <Achim.Gsell AT psi.ch>
To: h5part AT lists.psi.ch
Subject: Re: [H5part] Performance
Date: Wed, 19 Mar 2008 16:50:33 +0100
List-archive: <https://lists.web.psi.ch/pipermail/h5part/>
List-id: H5Part development and discussion <h5part.lists.psi.ch>
Organization: Paul Scherrer Institut

On Wednesday 19 March 2008 15:45, Andreas Adelmann wrote:

> > I recently ran some IO tests on a blue gene machine using 32-2048
> > processors for the writes of data. The tests were designed to mimic the
> > type of usage we expect from SPH codes which have approx 5000 particles
> > per processor (total in this case fo 2048 nodes = 10million)
> > Performance using H5Part was very slow compared to using a single block
> > write using raw HDF5 calls. The main difference was H5Part writes each
> > scalar array independently (more calls to write, more cache hit/miss
> > tests)

What du you mean with "writes each scalar array independently". Are you
talking about independent I/O or the fact that we write a dataset per scalar
array? If we put all arrays into one big array, I/O will be of course
faster ...

> > Has anyone done any performance tests/tuning on large numbers of
> > processors. Based on the results I have been getting, we will not use
> > H5Part on the bluegene (unless I can find a way of speeding it up)

Kurt Stockinger from LBNL runs some performance tests last year, but not on a
large number of nodes.

> > Thanks. (NB. I am in the process of collecting statistics which I'll
> > happily share with you all, but I would like to save myself the trouble
> > if any of you have done similar studies and published interesting
> > workarounds etc)

> Hello John good to hear from you! Sorry for the delay I am at Los Alamos
> right now!
> Attached a paper with old number and an xls sheed John Shalf has compiled
> comparing different architectures i.e. machines (in the xls file, HDF5
> == H5part).
> The data are not for very large number of nodes.

I cannot read the XLS file, but the number in the paper are misleading. The
raw-MPI tests use collective I/O and the H5Part test independent I/O.
Surprisingly independent I/O is faster than collective I/O - this is very
strange and even the HDF5 guys have no idea why ... It is possible to compile
H5Part in a way that HDF5 uses collective I/O (add -DCOLLECTIVE_IO to the
CFLAGS). By tuning MPI (setting MPI hints) we were able to get the same I/O
rate for collective and independent I/O. But collective I/O was never faster.
As far as I remember, we run tests with up to 256 nodes. It is possible, that
H5Part behaves different on a BlueGene and/or very large number of nodes.

> However you address a point that has to be looked at: in my opinion we
> have to prevent writing small
> junks of data and try to cash as much as possible and then write large
> junks of data.

Nice idea, but it will not help ;-(

With a cache you cannot reduce the I/O, you still have to write the same
number of datasets - it's the exactly the same number of HDF5 calls etc.

Achim
--
Paul Scherrer Institut; Villigen; Schweiz
Allgemeine Informationstechnologie (AIT)

[H5part] Performance, John Biddiscombe, 03/19/2008
- Re: [H5part] Performance, Andreas Adelmann, 03/19/2008
  - Re: [H5part] Performance, Achim Gsell, 03/19/2008
    - Re: [H5part] Performance, John Biddiscombe, 03/19/2008