h5part AT lists.psi.ch

Subject: H5Part development and discussion

List archive

Re: [H5part] SVN access available for contributions?

From: Achim Gsell <Achim.Gsell AT psi.ch>
To: h5part AT lists.psi.ch
Subject: Re: [H5part] SVN access available for contributions?
Date: Fri, 26 Jan 2007 16:17:16 +0100
List-archive: <https://lists.web.psi.ch/pipermail/h5part/>
List-id: H5Part development and discussion <h5part.lists.psi.ch>
Organization: Paul Scherrer Institut

On Friday 26 January 2007 06:54, John Shalf wrote:
> On Jan 25, 2007, at 1:42 AM, John Biddiscombe wrote:
> > John
> >>
> >> So I assume you want a convenience routine that can reassemble a
> >> list of arrays (stored on disk as scalar fields) and interlace
> >> them as vectors in memory. I assume you are *not* storing the
> >> data as N-component vector fields in the file though (am I
> >> correct?). If you've already done that, then it would be an
> >> excellent addition to the API. But we definitely want the disk
> >> image of the fields to be distinct vectors for various reasons.
> >
> >> So, to see if I understand this correctly, you would like a
> >> convenience function that allows you to specify a vector of
> >> dataset names (rather than a single name) and it would naturally
> >> interlace them in memory upon read? Is that the request? I think
> >> that should be pretty reasonable as well given HDF's support for
> >> strided memory spaces. Do you have a proposed appearance for this
> >> API? (something that doesn't use var_args since it would
> >> complicate the F90 bindings). Overall, this is also quite
> >> reasonable provided the on-disk image has them laid out as scalars
> >> (they can be reconstituted in memory as vectors).
> > I have already implemented the write and read back of N component
> > variables as N single components fields. I've tested it in parallel
> > with the combination of the memory space for the N-tuple arrays in
> > memory and the Dataspace for parallel IO and it's good.
> > I would like to contribute this to the main H5Part API. At the
> > moment, the writing out is fine, but I am still playing with the
> > read back and the interface is something along these lines
> > ReadNComponentArray(int NComponents, float/double/etc *dest, char
> > **arrayOfNames)
> > so the array of names chosen to be read is passed in and assumed to
> > be the same length as the number of components desired.
>
> Thats great. We should get it integrated straight away.
Mmm, still thinking about this. I don't see the benefit of this, can you give
us an example where this is really usefull and makes thing faster/simpler?
We should keep the H5Part API as simple as possible and add only functions,
which give as a real benefit.

> >> The current sorting algorithm for the file format will be able to
> >> accommodate the different numbering formats you propose, so your
> >> proposed change would be backward compatible with existing readers
> >> (that's a good thing). So this addition could be implemented as a
> >> convention rather than a requirement.
> > OK. I'm not familiar with existing sorting algorithms in the
> > context of H5Part. I find that I often browse files using NCSA's (I
> > think) HDF5 viewer package and it lists things using a straight
> > alphabetical sort.
>
> Its not so much for the viewer as for how the API sorts the steps as
> a sequence. The algorithm currently used within the API to sort the
> steps for reading will accept either format (the sorting algorithm
> used by h5dump and such is more picky).
>
> >> We do encourage liberal use of attributes to serve the individual
> >> needs of groups though, so you should definitely implement storage
> >> of the TimeValue attribute. We should probably document the
> >> attributes that various groups have proposed for their own local
> >> conventions.
> > OK.
> >
> >>
> >> When I say "convention" I mean additional features that can be
> >> used to extend the content of the file format using attributes.
> >> When it is a convention, then readers can be coded to look for
> >> them for added value, but they should also be prepared for their
> >> absence. We are trying to minimize the "requirements" so as to
> >> keep the readers as simple as possible.
> > I do think that a primary dataset group "Name"/prefix should be a
> > requirement. You already have backward imcompatibility between
> > "particles1" ans "steps1" - had this been a requirement previously,
> > then the files would be compatible. ($0.02)
>
> Well "partlclesX" was in the prototype (not the production release
> per se).
> The dataset group must be a requirement because it is the schema for
> the object storage format (irrespective of whether HDF5 is the
> underlying storage format). Attributes are quite different from data
> schemas.

As far as I see, this can be implemented without problem. The name (better
the format) of the dataset group can be defined with a special file
attribute. If this attribute is missing, the default Step#1... will be used.

The name of the dataset group must be defined after opening the file for
writing but before writing the first data. When opening the file, we can
check the special file attribute and set the dataset group name (format) in
the internal file structure. The API function could be something like:

h5part_int64_t H5PartDefineStepName (
H5PartFile *f,
const char *name,
int stepno_width );

With

H5PartDefineStepName( f, "Step", 6 )

we define dataset group names Step#000001, Step#00002, ...
and with

H5PartDefineStepName( f, "Step", 0 )

we have Step#1, Step#2, ...

> >> The decision to go with limited type support for the API was for
> >> two reasons
> > understood, I'd still like to add prototypes for the main types
> > commonly supported on all platforms. Not complex user defined
> > structures.
>
> No problem. We expected to expand the API on an as-needed basis as
> more types were encountered. So if you are encountering those types,
> then the API should expanded to accomodate.

Introducing data type shorter than 64bit makes sense, if we have large amount
of data with in this type (compared to 64bit): If we have 90% 32bit data and
10% 64bit data, it might be a good idea to store 32bit as 32bit. If we have
only 10% 32bit data but 90% 64bit, we have no or little advantage ...
We should implement new data types, if there is a benefit (performance, file
size,...).

> >> For various reasons, it is better for us to keep each vector
> >> component as separate scalar arrays on disk
> > I'm happy with this and am already doing it.
> >
> > I did find some other bugs which I'm hoping have been fixed. I had
> > problems when I compiled the code with parallel support, but was
> > not using Parallel IO and put a couple of extra checks in. I also
> > found a bug when the number of particles is dynamic and new mem/
> > data spaces are needed which wasn't handled correctly.
>
> I think those are bugs that Achim has dealt with. Also, Achim made
> the error checking far more robust.

Please check, if the bugs are fixed, otherwise send me a bug report or fixes.

Achim

PS: I will be offline from 2007-01-28 until 2007-02-08 (cross country
skiing :-) )

[H5part] SVN access available for contributions?, John Biddiscombe, 01/24/2007
- Re: [H5part] SVN access available for contributions?, John Biddiscombe, 01/24/2007
- Re: [H5part] SVN access available for contributions?, John Shalf, 01/25/2007
  - Re: [H5part] SVN access available for contributions?, John Biddiscombe, 01/25/2007
    - Re: [H5part] SVN access available for contributions?, Andreas Adelmann, 01/25/2007
      - Re: [H5part] SVN access available for contributions?, John Biddiscombe, 01/25/2007
    - Re: [H5part] SVN access available for contributions?, John Shalf, 01/26/2007
      - Re: [H5part] SVN access available for contributions?, Achim Gsell, 01/26/2007