h5part AT lists.psi.ch

Subject: H5Part development and discussion

List archive

Re: [H5part] SVN access available for contributions?

From: John Shalf <jshalf AT lbl.gov>
To: John Biddiscombe <biddisco AT cscs.ch>
Cc: h5part AT lists.psi.ch
Subject: Re: [H5part] SVN access available for contributions?
Date: Wed, 24 Jan 2007 22:14:13 -0800
List-archive: <https://lists.web.psi.ch/pipermail/h5part/>
List-id: H5Part development and discussion <h5part.lists.psi.ch>

On Jan 24, 2007, at 1:46 AM, John Biddiscombe wrote:

Dear H5Part users,

I'd like, if I may, to access the svn repository so that I can
a) update my copy of H5Part which is no doubt quite old
b) potentially contribute some improvements which I'll list below for your scrutiny

1) I found that reading particles into {x,y,z} arrays rather than into {x}, {y}, {z} separately was desirable from a visualization point of view so I have added some logic to my own code which handles the memory space issues around this. Also when writing data which is stored as an N-component vector field stored as {v0,v1,v2....vn} rather than N separate arrays.
I have found that to facilitate the display of data using any scalar field as the x,y,z coordinate it's nice to be able to write coordinate or vector fields as single component datasets (as is currently done with H5Part), but read them back in as a vector array again. I also find a lot of data with vector fields as u,v,w in separate arrays and wish to recombine these into a single {u,v,w} array.

So I assume you want a convenience routine that can reassemble a list of arrays (stored on disk as scalar fields) and interlace them as vectors in memory. I assume you are *not* storing the data as N- component vector fields in the file though (am I correct?). If you've already done that, then it would be an excellent addition to the API. But we definitely want the disk image of the fields to be distinct vectors for various reasons.

2) With reference to the above, when reading N components into a single vector field, one wishes to supply an array of names. for example, the default H5Part "x","y","z" names are used for coordinate variables, but If the vector field :"velocity" is written to file as "velocity_0", "velocity_1", velocity_2" (for 3D), I may wish to read them back as the coordinate array later, so supplying an array of names (e.g. "x", "velocity_2", "y") to a read function allows one to arbitrarily map datasets onto components.

So, to see if I understand this correctly, you would like a convenience function that allows you to specify a vector of dataset names (rather than a single name) and it would naturally interlace them in memory upon read? Is that the request? I think that should be pretty reasonable as well given HDF's support for strided memory spaces. Do you have a proposed appearance for this API? (something that doesn't use var_args since it would complicate the F90 bindings). Overall, this is also quite reasonable provided the on- disk image has them laid out as scalars (they can be reconstituted in memory as vectors).

3) the H5Part naming convention uses "Particles1" "Particles2" etc as the group names (renamed I understand in later versions to steps1,2 or similar). I would prefer to use an attribute attached at the top level which would be a standard string preserved for all time
"ParticleTimeStepName" this attribute would be set to "Particles" in earlier files and "steps" in later version. Additionally, I would like to add a second attribute which would be the FormatString
"ParticleTimeFormatString" by default one would use %i to represent a simple integer, but if %05i were set, then particles would be stored as Particles00001 Particles00002, which makes ascii listings of dataset contents more intuitive as they appear in the numerical order (as opposed to the current particles1 particles10 particles100 particles11 particles12 ....).

The current sorting algorithm for the file format will be able to accommodate the different numbering formats you propose, so your proposed change would be backward compatible with existing readers (that's a good thing). So this addition could be implemented as a convention rather than a requirement.

Although most simulations use time steps, it is frequently the case that data is dropped due to IO issues and using a Real time value might be desireable (eg 0.005, 0.0075, 0.1 - if time steps are not regular, but rather adaptive- in this case a special array which holds the actual time values should be present
"ParticleTimeValues" where the number of entries should match the number of groups or timesteps

We chose to punt on requiring time values because not all simulations are timedomain. Given the actual meaning of the steps is simulation dependent (and hence could be controversial), you can certainly use the attributes interface provided in the API to store said attribute. However, it would be a convention rather than a requirement given it is of limited use.

We do encourage liberal use of attributes to serve the individual needs of groups though, so you should definitely implement storage of the TimeValue attribute. We should probably document the attributes that various groups have proposed for their own local conventions.

When I say "convention" I mean additional features that can be used to extend the content of the file format using attributes. When it is a convention, then readers can be coded to look for them for added value, but they should also be prepared for their absence. We are trying to minimize the "requirements" so as to keep the readers as simple as possible.

4) Support for other data types is lacking - I regularly use int, short, float, double, byte and even boolean. Since HDF5 support all, so should H5Part. Since the majority of my code is in C++, I have used a combination of templates and macros to allow any data type to be read/written and it works fine. I would like to explore ways of extending this to other languages and adding support in the main h5part interface.

The decision to go with limited type support for the API was for two reasons
1) we didn't want to expose users to HDF5's rich type system. it is daunting for many users and can also result in very complex code for the readers. The current choice of data types is largely driven by the application codes that are currently writing to the file format. Most of them are doing Float32, Float64, or Int64. We didn't have much in the way of bool or byte data as of yet, but if you do have a code that writes bool or byte, we would certainly like to understand the requirement.
2) Half of our users are using the F90 bindings, so whatever solution we come up with must work across both the C and F90 bindings.

This can be the subject of an expanded discussion, but the current type limitations are purposely limited in order to simplify the logic for readers that employ languages that do not support templates/ macros for type-independent algorithms. Perhaps, there can be a variant on the read/write routines that does expose a more complex type system.

5) One of my emails appears in the archive where I mentioned DEM and SPH data and saw a reply that discussed meshed data. In this context, I was referring to Discrete Element Method particle data and Smoothed Particle Hydrodynamics particle data). I am encouraging users in both these domains to use H5Part as a storage form. They invariably have vector fields - so the use of multi- component scalars is essential. In the case of SPH data various kernel functions exist and I would like to store these as string attributes - I wonder if standard names (like those suggested above) could be chosen for Particle formats and enshrined somewhere so that

For various reasons, it is better for us to keep each vector component as separate scalar arrays on disk rather than using HDF's type interface (which doesn't allow reads of subcomponents of aggregate types). I think we can find some way to express writes of interlaced data so that they are laid out as non-interlaced datasets on disk. It is much more efficient to reorganize the data in memory (interlacing and deinterlacing) than do do sparse reads/writes to disk blocks.

my own svn repository of H5Part related material is here should anyone wish to browse it.
https://svn.cscs.ch/vtkContrib/trunk/vtkCSCS/vtkH5Part/

Thanks for reading. I'm sure most of these issues have already been dealt with in your most recent versions of H5Part. I would very much like to contribute any that have not.

regards

JB

--
John Biddiscombe, email:biddisco @ cscs.ch
http://www.cscs.ch/about/BJohn.php
CSCS, Swiss National Supercomputing Centre | Tel: +41 (91) 610.82.07
Via Cantonale, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82

_______________________________________________
H5Part mailing list
H5Part AT lists.psi.ch
https://lists.web.psi.ch/mailman/listinfo/h5part

[H5part] SVN access available for contributions?, John Biddiscombe, 01/24/2007
- Re: [H5part] SVN access available for contributions?, John Biddiscombe, 01/24/2007
- Re: [H5part] SVN access available for contributions?, John Shalf, 01/25/2007
  - Re: [H5part] SVN access available for contributions?, John Biddiscombe, 01/25/2007
    - Re: [H5part] SVN access available for contributions?, Andreas Adelmann, 01/25/2007
      - Re: [H5part] SVN access available for contributions?, John Biddiscombe, 01/25/2007
    - Re: [H5part] SVN access available for contributions?, John Shalf, 01/26/2007
      - Re: [H5part] SVN access available for contributions?, Achim Gsell, 01/26/2007